Demographic Patterns in South Sudan: A Tidyverse Exploration

A Beginner’s Guide to Data Analysis with R and the Tidyverse

R
Tidyverse
ggplot2
Demographics
Data Visualization
Learn data analysis step-by-step using R’s tidyverse!
Author

Alierwai Reng

Published

November 20, 2024

Demographic Patterns in South Sudan: A Tidyverse Exploration

Demographic Patterns in South Sudan: A Tidyverse Exploration

Learn data analysis step-by-step using R’s tidyverse!

Introduction

Demographic Patterns in South Sudan: A Tidyverse Exploration

A Beginner’s Guide to Data Analysis with R and the Tidyverse

Learn data analysis step-by-step using R’s tidyverse! This beginner-friendly tutorial explores South Sudan’s demographics with clear explanations, beautiful visualizations, and professional tables.

Welcome to this hands-on data analysis tutorial! This guide showcases dplyr 1.2.0 for powerful data manipulation—including the new filter_out() and recode_values() functions—and introduces key stringr 1.6.0 functions for cleaning and transforming text data, including the new case conversion trio: str_to_camel(), str_to_snake(), and str_to_kebab().

By the end of this guide, you’ll understand how to:

  • Load and explore real-world census data
  • Clean and transform data using tidyverse functions
  • Calculate summary statistics and group-level metrics
  • Create beautiful visualizations with ggplot2
  • Build professional tables with gt

We will analyze South Sudan’s 2008 census data as a practical case study; however, the analytical techniques and workflows you will learn are fully transferable to any dataset across domains and contexts.

The data were obtained from the National Bureau of Statistics, South Sudan, via the Open Data for Africa platform: Population by Age and Sex (2008) — http://southsudan.opendataforafrica.org/fvjqdpe/population-by-age-and-sex-2008-south-sudan

TipWhat is the Tidyverse?

The tidyverse is a collection of R packages designed to work together seamlessly for data science. Think of it as a complete toolkit where all the tools fit perfectly in your hand. Key packages include:

  • dplyr — data manipulation (filter, mutate, select, group, summarize)—the focus of this tutorial
  • stringr — string manipulation (clean, extract, transform text)—introduced throughout this tutorial
  • ggplot2 - data visualization (charts and graphs)
  • tidyr - data tidying (reshape and clean)
  • readr - data import (read CSV, Excel, etc.)

These packages share a common design philosophy, making your code readable and your workflow intuitive!

NoteTested With

R 4.4.x, dplyr 1.2.0, stringr 1.6.0, ggplot2 3.5.x, gt 0.11.x.

ImportantWhat’s New in dplyr 1.2.0 & stringr 1.6.0

dplyr 1.2.0 (released February 2026) introduces powerful new tools:

  • filter_out() — the missing complement to filter(). Drop rows instead of keeping them, with cleaner boolean logic.
  • recode_values() — create entirely new columns by mapping old values to new values. Replaces case_match() with a cleaner formula or from/to interface.
  • replace_values() — partially update an existing column while preserving its type.
  • replace_when() — conditionally replace rows within columns, a type-stable alternative to if_else().
  • when_any() and when_all() — elementwise OR/AND helpers for multi-column conditions.

stringr 1.6.0 (released November 2025) adds:

  • str_to_camel(), str_to_snake(), str_to_kebab() — convert between programming case conventions.
  • str_ilike() — case-insensitive SQL-like pattern matching.

We’ll showcase several of these throughout this tutorial!


Part 1: Environment Setup

Step 1: Load Required Packages

Every R analysis starts by loading the libraries (packages) we need. Think of packages as apps on your phone—each one adds specific capabilities.

# Core tidyverse packages (loads dplyr, ggplot2, tidyr, readr, and more!)
library(tidyverse)      
library(janitor)

# Table formatting packages
library(gt)             # Grammar of Tables - for beautiful tables
library(gtExtras)       # Extra features for gt tables

# Visualization enhancement packages
library(ggtext)         # Rich text formatting in ggplot2
library(scales)         # Scale functions for axes and labels
library(glue)           # Easy string interpolation

# Confirmation message
cat("✅ All packages loaded successfully!\n")
✅ All packages loaded successfully!
cat("📦 Tidyverse version:", as.character(packageVersion("tidyverse")), "\n")
📦 Tidyverse version: 2.0.0 
NotePackage Installation

If needed, install with: install.packages(c("tidyverse", "gt", "gtExtras", "ggtext", "scales", "glue"))

Step 2: Configure Visualization Theme

theme_set(
  theme_minimal(base_size = 13, base_family = "sans") +
    theme(
      plot.title = element_markdown(size = 16, face = "bold", color = "#06b6d4", margin = margin(b = 10)),
      plot.subtitle = element_markdown(size = 12, color = "#666666", margin = margin(b = 15)),
      plot.caption = element_markdown(size = 9, color = "#999999", hjust = 0),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_line(color = "#e5e5e5", linewidth = 0.3),
      legend.position = "top",
      legend.title = element_text(face = "bold", size = 11),
      axis.title = element_text(face = "bold", size = 11)
    )
)

cat("🎨 Custom theme configured!\n")
🎨 Custom theme configured!

Part 2: Loading and Exploring Data

Step 3: Load Census Data from URL

url <- "https://raw.githubusercontent.com/tongakuot/r_tutorials/main/06-data-wrangling/00-input/ss_2008_census_data_raw.csv"

census_raw <- read_csv(url, show_col_types = FALSE)

cat("✅ Data loaded successfully!\n")
✅ Data loaded successfully!
cat("📋 Dimensions:", nrow(census_raw), "rows ×", ncol(census_raw), "columns\n")
📋 Dimensions: 453 rows × 10 columns

Step 4: Examine the Data Structure

glimpse(census_raw)
Rows: 453
Columns: 10
$ Region              <chr> "KN.A2", "KN.A2", "KN.A2", "KN.A2", "KN.A2", "KN.A…
$ `Region Name`       <chr> "Upper Nile", "Upper Nile", "Upper Nile", "Upper N…
$ `Region - RegionId` <chr> "SS-NU", "SS-NU", "SS-NU", "SS-NU", "SS-NU", "SS-N…
$ Variable            <chr> "KN.B2", "KN.B2", "KN.B2", "KN.B2", "KN.B2", "KN.B…
$ `Variable Name`     <chr> "Population, Total (Number)", "Population, Total (…
$ Age                 <chr> "KN.C1", "KN.C2", "KN.C3", "KN.C4", "KN.C5", "KN.C…
$ `Age Name`          <chr> "Total", "0 to 4", "5 to 9", "10 to 14", "15 to 19…
$ Scale               <chr> "units", "units", "units", "units", "units", "unit…
$ Units               <chr> "Persons", "Persons", "Persons", "Persons", "Perso…
$ `2008`              <dbl> 964353, 150872, 151467, 126140, 103804, 82588, 767…

Step 5: Preview the Data

# head() shows the first n rows
census_raw |>
  head(10)
# A tibble: 10 × 10
   Region `Region Name` `Region - RegionId` Variable `Variable Name`       Age  
   <chr>  <chr>         <chr>               <chr>    <chr>                 <chr>
 1 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C1
 2 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C2
 3 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C3
 4 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C4
 5 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C5
 6 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C6
 7 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C7
 8 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C8
 9 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C9
10 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C…
# ℹ 4 more variables: `Age Name` <chr>, Scale <chr>, Units <chr>, `2008` <dbl>

Part 3: Data Cleaning

Raw data is rarely analysis-ready. We need to clean and standardize it first!

NoteIntroducing stringr for Text Cleaning

The stringr package (part of tidyverse) provides consistent, intuitive functions for string manipulation. In this section, we’ll use several key functions:

Function Purpose Example
str_to_lower() Convert to lowercase “HELLO” → “hello”
str_to_upper() Convert to UPPERCASE “hello” → “HELLO”
str_to_title() Convert to Title Case “hello world” → “Hello World”
str_to_sentence() Sentence case “hello world” → “Hello world”
str_to_camel() Convert to camelCase “quick brown fox” → “quickBrownFox”
str_to_snake() Convert to snake_case “Quick Brown Fox” → “quick_brown_fox”
str_to_kebab() Convert to kebab-case “Quick Brown Fox” → “quick-brown-fox”
str_squish() Remove extra whitespace ” hello world ” → “hello world”
str_replace_all() Replace patterns “a b c” → “a_b_c”

The first seven are case conversion functions. The trio str_to_camel(), str_to_snake(), and str_to_kebab() are new in stringr 1.6.0 and convert between programming naming conventions—essential when bridging R data with Python, JavaScript, or API outputs.

stringr 1.6.0: Case Conversion in Action

Before stringr 1.6.0, converting between camelCase, snake_case, and kebab-case required manual regex or external packages like snakecase. Now these conversions are built in:

# New in stringr 1.6.0: convert between programming case conventions
demo_text <- "south sudan census data"

cat("Original:    ", demo_text, "\n")
cat("camelCase:   ", str_to_camel(demo_text), "\n")
cat("PascalCase:  ", str_to_camel(demo_text, first_upper = TRUE), "\n")
cat("snake_case:  ", str_to_snake(demo_text), "\n")
cat("kebab-case:  ", str_to_kebab(demo_text), "\n")
cat("Title Case:  ", str_to_title(demo_text), "\n")
ImportantTransition: Old → New Case Conversion

Before stringr 1.6.0 — manual regex or external package:

# Required snakecase package or manual work
snakecase::to_snake_case("South Sudan Census")
gsub(" ", "-", tolower("South Sudan Census"))

After stringr 1.6.0 — native, consistent, pipe-friendly:

# Built into stringr — works seamlessly in tidyverse pipelines
"South Sudan Census" |> str_to_snake()   # "south_sudan_census"
"South Sudan Census" |> str_to_kebab()   # "south-sudan-census"
"South Sudan Census" |> str_to_camel()   # "southSudanCensus"

These are particularly valuable when bridging R data with Python (snake_case), JavaScript (camelCase), or URL slugs (kebab-case).

Step 6: Clean and Transform the Dataset

Here’s a practical cleaning pipeline using the janitor package for automatic name standardization:

census_clean <-
  census_raw |>

  # Standardize column names automatically
   clean_names() |>

  # Rename to meaningful names
  select(
    state = region_name,
    gender = variable_name,
    age_category = age_name,
    population = x2008
  ) |>

  # Clean text columns and convert population to integer
  mutate(
    across(where(is.character), \(x) str_squish(x) |> str_to_title()),
    population = as.integer(population)
  ) |>

  # Remove rows with missing or invalid data
  filter(!is.na(population), population > 0)

# Display results
cat("✅ Data cleaning complete!\n")
✅ Data cleaning complete!
cat("📊 Cleaned dataset:", nrow(census_clean), "rows ×", ncol(census_clean), "columns\n")
📊 Cleaned dataset: 450 rows × 4 columns
TipThe Cleaning Pipeline Explained

Key steps: - clean_names(): Automatically standardizes column names (lowercase, underscores, safe for R) - select(): Chooses columns and renames them simultaneously - across(where(is.character), ...): Applies the same transformation to all text columns at once - str_squish() + str_to_title(): Removes extra spaces and capitalizes properly - as.integer(): Converts population to whole numbers (appropriate for count data) - filter(): Removes rows with missing or invalid population values

This single approach combines efficiency with clarity!

Step 7: Preview the Cleaned Data

census_clean |>
  head(10)
# A tibble: 10 × 4
   state      gender                     age_category population
   <chr>      <chr>                      <chr>             <int>
 1 Upper Nile Population, Total (Number) Total            964353
 2 Upper Nile Population, Total (Number) 0 To 4           150872
 3 Upper Nile Population, Total (Number) 5 To 9           151467
 4 Upper Nile Population, Total (Number) 10 To 14         126140
 5 Upper Nile Population, Total (Number) 15 To 19         103804
 6 Upper Nile Population, Total (Number) 20 To 24          82588
 7 Upper Nile Population, Total (Number) 25 To 29          76754
 8 Upper Nile Population, Total (Number) 30 To 34          63134
 9 Upper Nile Population, Total (Number) 35 To 39          56806
10 Upper Nile Population, Total (Number) 40 To 44          42139

Part 3B: Advanced String Processing

Now we need to extract more meaningful information from our data. The gender column actually contains structured text like “Population - Male (Number)” that we can parse!

Step 8: Examine the Gender Column Structure

# See unique values in gender column
cat("🔍 Unique values in gender column:\n")
🔍 Unique values in gender column:
census_clean |>
  distinct(gender) |>
  pull(gender)
[1] "Population, Total (Number)"  "Population, Male (Number)"  
[3] "Population, Female (Number)"

The gender column follows a pattern: “Population - Gender (Type)” where we need to extract just the gender value.

Step 9: Extract Gender Information

The gender column contains structured text like “Population - Male (Number)”. We’ll extract just the gender value using the most practical approaches.

Method 1: Using str_split_i() — Simple and Direct

Split the text and extract the piece you need:

# Apply gender extraction to our dataset
census_parsed <- census_clean |>
  mutate(
    # Split on " ", extract 2nd piece, then remove "(Number)" text
    gender = str_split_i(gender, " ", 2) |>
             str_remove(" \\(.*\\)") |>
             str_squish()
  )

# Verify extraction worked
cat("✅ Gender extraction complete!\n")
✅ Gender extraction complete!
cat("🎯 Unique gender values:\n")
🎯 Unique gender values:
census_parsed |>
  distinct(gender) |>
  pull(gender)
[1] "Total"  "Male"   "Female"
TipHow This Works
  • str_split_i(gender, " ", 2): Split on spaces, extract the 2nd piece
    • “Population - Male (Number)” → “- Male (Number)”
  • str_remove(" \\(.*\\)"): Remove ” (anything)” pattern
    • “- Male (Number)” → “- Male”
  • str_squish(): Clean extra whitespace
    • “- Male” → “Male”

This is the most practical approach for extraction tasks!

Method 2: Using separate_wider_delim() — When You Need Multiple Pieces

If you need to keep multiple pieces from a split, use separate_wider_delim():

# Demonstrate separate_wider_delim()
demo_separate <- census_clean |>
  select(original_gender = gender) |>
  separate_wider_delim(
    cols = original_gender,
    delim = " ",
    names = c("prefix", "gender_raw", "suffix"),
    too_few = "align_start",
    cols_remove = FALSE
  ) |>
  mutate(gender_clean = str_remove(gender_raw, " \\(.*\\)")) |>
  select(original_gender, prefix, gender_raw, gender_clean) |>
  distinct() |>
  head(3)

cat("✅ separate_wider_delim() keeps all pieces:\n")
✅ separate_wider_delim() keeps all pieces:
demo_separate
# A tibble: 3 × 4
  original_gender             prefix      gender_raw gender_clean
  <chr>                       <chr>       <chr>      <chr>       
1 Population, Total (Number)  Population, Total      Total       
2 Population, Male (Number)   Population, Male       Male        
3 Population, Female (Number) Population, Female     Female      
NoteWhen to Use Each Method

str_split_i() (Method 1): - You only need one piece - More concise code - Recommended for this task

separate_wider_delim() (Method 2): - You need multiple pieces as separate columns - Better for data reshaping workflows - Excellent when working with structured delimited text

For our case: We use Method 1 because we only extract the gender value.

Method 3: Regular Expressions — For Complex Patterns

For more complex text patterns, use regex with str_extract():

# Regex approach (for reference/learning)
demo_regex <- census_clean |>
  select(original_gender = gender) |>
  mutate(
    # Regex: text between " " and " ("
    gender_regex = str_extract(original_gender, "(?<= ).*(?= \\()")
  ) |>
  distinct() |>
  head(3)

cat("✅ Regex extraction with lookaround assertions:\n")
✅ Regex extraction with lookaround assertions:
demo_regex
# A tibble: 3 × 2
  original_gender             gender_regex
  <chr>                       <chr>       
1 Population, Total (Number)  Total       
2 Population, Male (Number)   Male        
3 Population, Female (Number) Female      
NoteRegex Lookaround Basics

"(?<= ).*(?= \\()" extracts text between two patterns: - (?<= ) — Lookbehind: preceded by space-dash-space - .* — Match: any characters - (?= \\() — Lookahead: followed by ” (”

Result: Extracts “Male” from “Population - Male (Number)”

Additional approaches (regex lookahead, separate + unite, etc.) are covered in the stringr documentation.

Step 10: Recategorize Age Groups

Let’s group the fine-grained 5-year age bands into broader, more interpretable categories:

cat("🔍 Current age categories:\n")
🔍 Current age categories:
census_parsed |>
  distinct(age_category) |>
  arrange(age_category) |>
  pull(age_category)
 [1] "0 To 4"   "10 To 14" "15 To 19" "20 To 24" "25 To 29" "30 To 34"
 [7] "35 To 39" "40 To 44" "45 To 49" "5 To 9"   "50 To 54" "55 To 59"
[13] "60 To 64" "65+"      "Total"   

Method 1: Using case_when() for Conditional Recategorization

The case_when() function is perfect for complex, multi-condition transformations:

census_final <- census_parsed |>
  mutate(
    age_category  = str_to_lower(age_category),
    age_category = case_when(
      # Children (0-14)
      age_category %in% c("0 to 4", "5 to 9", "10 to 14") ~ "0-14",
      
      # Youth (15-24)
      age_category %in% c("15 to 19", "20 to 24") ~ "15-24",
      
      # Early working age (25-34)
      age_category %in% c("25 to 29", "30 to 34") ~ "25-34",
      
      # Middle working age (35-44)
      age_category %in% c("35 to 39", "40 to 44") ~ "35-44",
      
      # Later working age (45-54)
      age_category %in% c("45 to 49", "50 to 54") ~ "45-54",
      
      # Pre-retirement (55-64)
      age_category %in% c("55 to 59", "60 to 64") ~ "55-64",
      
      # Retirement age (65+)
      age_category == "65+" ~ "65+",
      
      # Catch any unexpected values
      TRUE ~ age_category
    )
  )

# Verify recategorization
cat("✅ Age categories recategorized!\n")
✅ Age categories recategorized!
cat("🎯 New age categories:\n")
🎯 New age categories:
census_final |> 
  distinct(age_category) |> 
  arrange(age_category) |>
  pull(age_category)
[1] "0-14"  "15-24" "25-34" "35-44" "45-54" "55-64" "65+"   "total"
ImportantUnderstanding case_when()

case_when() is like a multi-way IF statement (similar to SQL’s CASE WHEN):

Structure:

case_when(
  condition1 ~ result1,  # If condition1 is TRUE, return result1
  condition2 ~ result2,  # Else if condition2 is TRUE, return result2
  condition3 ~ result3,  # Else if condition3 is TRUE, return result3
  TRUE ~ default         # Else return default (catch-all)
)

Key points: - Conditions are evaluated in order (first match wins) - %in% checks if value is in a vector (like “is one of”) - ~ separates condition from result - TRUE ~ at the end catches anything not matched above

Example for our data: - If age is “0 to 4” OR “5 to 9” OR “10 to 14” → return “0-14” - Else if age is “15 to 19” OR “20 to 24” → return “15-24” - And so on…

Method 2: Using recode_values() — New in dplyr 1.2.0

For value-to-value mappings, the new recode_values() is cleaner than case_when(). It accepts a from/to lookup table or formula syntax:

# Build a lookup table — clean and portable
age_lookup <- tribble(
  ~from,       ~to,
  "0 to 4",    "0-14",
  "5 to 9",    "0-14",
  "10 to 14",  "0-14",
  "15 to 19",  "15-24",
  "20 to 24",  "15-24",
  "25 to 29",  "25-34",
  "30 to 34",  "25-34",
  "35 to 39",  "35-44",
  "40 to 44",  "35-44",
  "45 to 49",  "45-54",
  "50 to 54",  "45-54",
  "55 to 59",  "55-64",
  "60 to 64",  "55-64",
  "65+",       "65+"
)

# Demonstrate recode_values() (not applied to dataset)
demo_recode <- census_parsed |>
  select(age_category) |>
  mutate(
    age_category = str_to_lower(age_category),
    age_recoded = recode_values(
      age_category,
      from = age_lookup$from,
      to = age_lookup$to
    )
  ) |>
  distinct()

cat("✅ recode_values() demonstration:\n")
✅ recode_values() demonstration:
demo_recode
# A tibble: 15 × 2
   age_category age_recoded
   <chr>        <chr>      
 1 total        <NA>       
 2 0 to 4       0-14       
 3 5 to 9       0-14       
 4 10 to 14     0-14       
 5 15 to 19     15-24      
 6 20 to 24     15-24      
 7 25 to 29     25-34      
 8 30 to 34     25-34      
 9 35 to 39     35-44      
10 40 to 44     35-44      
11 45 to 49     45-54      
12 50 to 54     45-54      
13 55 to 59     55-64      
14 60 to 64     55-64      
15 65+          65+        
TipComparing case_when() vs recode_values()

Use case_when() when:

  • Multiple conditions per category (AND/OR logic)
  • Complex conditional expressions
  • Best when conditions aren’t simple equality checks

Use recode_values() (dplyr 1.2.0) when:

  • Simple value-to-value mappings
  • Lookup table stored externally (CSV, tribble)
  • Cleaner syntax for 1-to-1 replacements
  • Replaces the superseded case_match() and recode()

Both work for our use case. We use case_when() above because it groups related age bands together, making the logic visible. For production code with many mappings, recode_values() with a lookup table is more maintainable.

ImportantTransition: Old → New Recoding Functions

Superseded — recode() (avoid in new code):

recode(x, "0 to 4" = "0-14", "5 to 9" = "0-14", .default = x)

Soft-deprecated — case_match() (migrate to recode_values()):

case_match(x, c("0 to 4", "5 to 9", "10 to 14") ~ "0-14", .default = x)

New — recode_values() with formula syntax:

recode_values(x, "0 to 4" ~ "0-14", "5 to 9" ~ "0-14", default = x)

New — recode_values() with lookup table (recommended for many mappings):

recode_values(x, from = age_lookup$from, to = age_lookup$to)

New — replace_values() for partial updates (preserves column type):

replace_values(x, "Total" ~ "All Genders")

The recode_values() function replaces both recode() and case_match(). Use replace_values() when you only need to change a few values while keeping the rest intact.

Step 11: Filter and Verify Final Dataset

Remove aggregate rows before analysis. In dplyr 1.2.0, the new filter_out() makes this intent explicit—you specify what to drop rather than negate conditions:

# Remove rows where gender or age_category is "Total"
# In dplyr 1.2.0, you can use: filter_out(gender == "Total" | age_category == "Total")
census_filtered <- census_final |>
  filter_out(gender == "Total" | age_category == "total")

census_filtered
# A tibble: 280 × 4
   state      gender age_category population
   <chr>      <chr>  <chr>             <int>
 1 Upper Nile Male   0-14              82690
 2 Upper Nile Male   0-14              83744
 3 Upper Nile Male   0-14              71027
 4 Upper Nile Male   15-24             57387
 5 Upper Nile Male   15-24             42521
 6 Upper Nile Male   25-34             38795
 7 Upper Nile Male   25-34             32236
 8 Upper Nile Male   35-44             30228
 9 Upper Nile Male   35-44             22290
10 Upper Nile Male   45-54             18163
# ℹ 270 more rows
ImportantTransition: Old → New Row Filtering

Before dplyr 1.2.0 — negated logic with filter():

# Awkward double negation — "keep rows where gender is NOT Total AND age is NOT Total"
census_final |> filter(gender != "Total", age_category != "Total")

After dplyr 1.2.0 — direct intent with filter_out():

# Clear intent — "drop rows where gender IS Total OR age IS Total"
census_final |> filter_out(gender == "Total" | age_category == "total")

Why this matters:

  • filter() keeps rows that match → forces negated logic to drop rows (!=, !)
  • filter_out() drops rows that match → write positive conditions, cleaner boolean logic
  • filter_out() handles NA values more predictably (rows with NA are kept, not silently dropped)

Bonus — when_any() and when_all() for multi-column conditions:

# Drop rows where ANY column equals "Total" (works across many columns!)
census_final |> filter_out(when_any(c(gender, age_category), ~ . == "Total"))

These helpers compose naturally with both filter() and filter_out(), making multi-column conditions readable.

Let’s confirm all transformations worked correctly:

# Generate a comprehensive summary of the cleaned and transformed dataset
# This verification step ensures all transformations were applied correctly

cat(strrep("=", 50), "\n", sep = "")
==================================================
cat("🎉 DATA TRANSFORMATION COMPLETE!\n")
🎉 DATA TRANSFORMATION COMPLETE!
cat(strrep("=", 50), "\n", sep = "")
==================================================
# Display dimensions of the final cleaned dataset
cat("📊 Final dataset dimensions:\n")
📊 Final dataset dimensions:
cat("  Rows:", nrow(census_filtered), "\n")
  Rows: 280 
cat("  Columns:", ncol(census_filtered), "\n\n")
  Columns: 4 
# List all column names in the cleaned dataset
cat("✅ Column names:\n")
✅ Column names:
cat("  ", paste(names(census_final), collapse = ", "), "\n\n")
   state, gender, age_category, population 
# Show all unique gender categories (should only be "Male", "Female", and "Total" before filtering)
cat("🎯 Unique gender values:\n")
🎯 Unique gender values:
census_filtered |> distinct(gender) |> pull(gender) |> cat("  ", "\n")
Male Female    
# Show all unique age categories in sorted order
cat("\n🎯 Unique age categories:\n")

🎯 Unique age categories:
census_filtered |> distinct(age_category) |> arrange(age_category) |> pull(age_category) |> cat("  ", "\n")
0-14 15-24 25-34 35-44 45-54 55-64 65+    
# Display first 10 rows of key columns to visually inspect the data
cat("\n📋 Sample of final data:\n")

📋 Sample of final data:
census_filtered |> 
  select(state, gender, age_category, population) |>
  head(10)
# A tibble: 10 × 4
   state      gender age_category population
   <chr>      <chr>  <chr>             <int>
 1 Upper Nile Male   0-14              82690
 2 Upper Nile Male   0-14              83744
 3 Upper Nile Male   0-14              71027
 4 Upper Nile Male   15-24             57387
 5 Upper Nile Male   15-24             42521
 6 Upper Nile Male   25-34             38795
 7 Upper Nile Male   25-34             32236
 8 Upper Nile Male   35-44             30228
 9 Upper Nile Male   35-44             22290
10 Upper Nile Male   45-54             18163
NoteData Transformation Summary

What we accomplished:

Original gender column values: - “Population - Male (Number)” - “Population - Female (Number)” - “Population - Total (Number)”

Transformed to clean format: - “Male” - “Female” - “Total”

Original age_category column structure: - 16 individual 5-year age bands - Examples: “0 To 4”, “5 To 9”, “10 To 14”, …, “65+”

Transformed to standardized life-stage categories: - 7 broader, more interpretable age groups - “0-14”, “15-24”, “25-34”, “35-44”, “45-54”, “55-64”, “65+”

Result: A clean, standardized dataset ready for analysis and visualization! ✨


Part 4: Data Exploration and Summary

Step 12: Create Overview Statistics

Let’s calculate some key statistics about our dataset:

# Create a summary table
overview_table <- census_filtered |>
  summarise(
    `Total Population` = comma(sum(population)),     # Format with commas
    `Number of States` = n_distinct(state),          # Count unique states
    `Age Categories` = n_distinct(age_category),     # Count unique ages
    `Gender Groups` = n_distinct(gender),            # Count unique genders
    `Total Observations` = comma(n())                # Count all rows
  )

# Display the summary
overview_table
# A tibble: 1 × 5
  `Total Population` `Number of States` `Age Categories` `Gender Groups`
  <chr>                           <int>            <int>           <int>
1 8,260,490                          10                7               2
# ℹ 1 more variable: `Total Observations` <chr>
NoteUnderstanding summarise()

summarise() collapses data into summary statistics:

  • sum() - adds up values
  • n_distinct() - counts unique values
  • n() - counts total rows
  • comma() - formats numbers with commas (from scales package)

It reduces many rows into one row of summaries!

Step 13: Display as Professional Table

Now let’s make this summary look professional using the gt package:

overview_table |>
  gt() |>
  tab_header(
    title = md("**South Sudan 2008 Census Overview**"),
    subtitle = "Key Summary Statistics"
  ) |>
  tab_style(
    style = cell_fill(color = "#22d3ee"),
    locations = cells_body()
  ) |>
  tab_style(
    style = cell_text(color = "white", weight = "bold"),
    locations = cells_body()
  ) |>
  tab_options(
    table.font.size = px(13),
    heading.title.font.size = px(16),
    heading.subtitle.font.size = px(12)
  )
South Sudan 2008 Census Overview
Key Summary Statistics
Total Population Number of States Age Categories Gender Groups Total Observations
8,260,490 10 7 2 280
TipThe Grammar of Tables (gt)

The gt package uses a layered approach (like ggplot2 for tables):

  1. Start with datagt()
  2. Add headerstab_header()
  3. Style cellstab_style()
  4. Format numbersfmt_number()
  5. Adjust optionstab_options()

Each layer adds or modifies the table appearance!


Part 5: Gender Analysis

Step 14: Calculate National Gender Distribution

Let’s analyze how the population is distributed by gender:

gender_summary <- census_filtered |>
  
  # Step 1: Group data by gender
  group_by(gender) |>
  
  # Step 2: Calculate total population for each gender
  summarise(
    population = sum(population),
    .groups = "drop"  # Remove grouping after summarise
  ) |>
  
  # Step 3: Calculate percentages
  mutate(
    percentage = population / sum(population) * 100,
    percentage_label = percent(percentage / 100, accuracy = 0.01)
  ) |>
  
  # Step 4: Sort by population (largest first)
  arrange(desc(population))

# Display the results
gender_summary
# A tibble: 2 × 4
  gender population percentage percentage_label
  <chr>       <int>      <dbl> <chr>           
1 Male      4287300       51.9 51.90%          
2 Female    3973190       48.1 48.10%          
ImportantUnderstanding group_by() and summarise()

These two functions work together like a team:

group_by(gender) - Splits data into groups (one for Male, one for Female) - Like separating cards into piles

summarise(population = sum(population)) - Performs calculations within each group - sum() adds up all population values in each group - Like counting cards in each pile

.groups = "drop" - Removes the grouping after we’re done - Prevents unexpected behavior in future operations

Final result: One row per gender with total population!

Step 15: Display Gender Table

gender_summary |>
  # Rename columns for display
  select(
    Gender = gender, 
    Population = population, 
    `Percentage` = percentage_label
  ) |>
  
  # Create gt table
  gt() |>
  
  # Add title and subtitle
  tab_header(
    title = md("**National Gender Distribution**"),
    subtitle = "South Sudan 2008 Census"
  ) |>
  
  # Format population with commas
  fmt_number(
    columns = Population,
    decimals = 0,
    use_seps = TRUE
  ) |>
  
  # Style Male row (first row) - cyan background
  tab_style(
    style = list(
      cell_fill(color = "#22d3ee"),
      cell_text(color = "white", weight = "bold")
    ),
    locations = cells_body(rows = 1)
  ) |>
  
  # Style Female row (second row) - gold background
  tab_style(
    style = list(
      cell_fill(color = "#FFD700"),
      cell_text(color = "#000000", weight = "bold")
    ),
    locations = cells_body(rows = 2)
  ) |>
  
  # Center all columns
  cols_align(align = "center", columns = everything()) |>
  
  # Adjust font sizes
  tab_options(
    table.font.size = px(13),
    heading.title.font.size = px(16)
  )
National Gender Distribution
South Sudan 2008 Census
Gender Population Percentage
Male 4,287,300 51.90%
Female 3,973,190 48.10%

Step 16: Visualize Gender Distribution

Numbers are great, but visualizations make patterns instantly clear. Let’s create a pie chart:

ggplot(gender_summary, aes(x = "", y = population, fill = gender)) +
  
  # Create a bar chart (we'll turn it into a pie)
  geom_col(width = 1, color = "white", linewidth = 2) +
  
  # Convert bar chart to pie chart using polar coordinates
  coord_polar(theta = "y") +
  
  # Set custom colors for Male and Female
  scale_fill_manual(
    values = c(
      "Male" = "#22d3ee",
      "Female" = "#FFD700"
    )
  ) +
  
  # Add labels showing counts and percentages
  geom_text(
    aes(label = glue("{comma(population)}\n({percentage_label})")),
    position = position_stack(vjust = 0.5),  # Center in each slice
    size = 5,
    fontface = "bold",
    color = "white"
  ) +
  
  # Add titles and labels
  labs(
    title = "**Gender Distribution in South Sudan**",
    subtitle = "2008 Census Data",
    fill = "Gender"
  ) +
  
  # Use void theme for pie charts (removes axes)
  theme_void() +
  
  # Customize title and legend
  theme(
    plot.title = element_markdown(
      size = 16, 
      face = "bold", 
      color = "#06b6d4",
      hjust = 0.5,        # Center title
      margin = margin(b = 5)
    ),
    plot.subtitle = element_markdown(
      size = 12, 
      color = "#666666", 
      hjust = 0.5         # Center subtitle
    ),
    legend.position = "bottom",
    legend.title = element_text(face = "bold", size = 11)
  )

Gender distribution shown as a pie chart with population counts and percentages
TipAnatomy of a ggplot2 Chart

Every ggplot2 visualization follows this pattern:

  1. Start with dataggplot(data, aes(...))
  2. Add geometrygeom_*() (point, line, bar, etc.)
  3. Adjust scalesscale_*() (colors, axes, etc.)
  4. Add labelslabs() (title, axes, etc.)
  5. Apply themetheme_*() (appearance)

Think of it like building with LEGO blocks—each layer adds something!

Bonus: coord_polar() transforms rectangular plots into circular ones (bar chart → pie chart)!

Step 17: Gender Distribution by State

Now let’s see how gender distribution varies across different states:

state_gender <- census_filtered |>
  
  # Group by both state AND gender
  group_by(state, gender) |>
  
  # Sum population within each state-gender combination
  summarise(population = sum(population), .groups = "drop") |>
  
  # Reshape from long to wide format
  # Before: Multiple rows per state (one for Male, one for Female)
  # After: One row per state (Male and Female as separate columns)
  pivot_wider(names_from = gender, values_from = population) |>
  
  # Calculate additional metrics
  mutate(
    total = Male + Female,                    # Total population
    male_pct = Male / total * 100,           # Male percentage
    female_pct = Female / total * 100,       # Female percentage
    gender_ratio = Male / Female * 100       # Males per 100 females
  ) |>
  
  # Sort by total population (largest first)
  arrange(desc(total))

# Display top 5 states
state_gender |> 
  head(5)
# A tibble: 5 × 7
  state             Female   Male   total male_pct female_pct gender_ratio
  <chr>              <int>  <int>   <int>    <dbl>      <dbl>        <dbl>
1 Jonglei           624275 734327 1358602     54.1       45.9        118. 
2 Central Equatoria 521835 581722 1103557     52.7       47.3        111. 
3 Warrap            502194 470734  972928     48.4       51.6         93.7
4 Upper Nile        438923 525430  964353     54.5       45.5        120. 
5 Eastern Equatoria 440974 465187  906161     51.3       48.7        105. 
NoteUnderstanding pivot_wider()

pivot_wider() reshapes data from long to wide format:

Before (Long format):

State      Gender  Population
Jonglei    Male    734327
Jonglei    Female  624275
Warrap     Male    470734
Warrap     Female  502194

After (Wide format):

State    Male    Female  Total
Jonglei  734327  624275   1358602
Warrap   470734  502194   972928

Why? Because it’s easier to calculate ratios and percentages when Male and Female are in separate columns!

Step 18: Display State Gender Table

state_gender |>
  head(5) |>
  
  # Select and rename columns for display
  select(
    State = state,
    Male,
    Female,
    Total = total,
    `Male %` = male_pct,
    `Female %` = female_pct,
    `Gender Ratio` = gender_ratio
  ) |>
  
  # Create table
  gt(rowname_col = "State") |>
  cols_align(columns = State, align = "right") |> 
  # Add header
  tab_header(
    title = md("**Gender Distribution by State**"),
    subtitle = "Top 10 Most Populous States"
  ) |>
  
  # Format population columns with commas
  fmt_number(
    columns = c(Male, Female, Total),
    decimals = 0,
    use_seps = TRUE
  ) |>
  
  # Format percentage and ratio columns
  fmt_number(
    columns = c(`Male %`, `Female %`, `Gender Ratio`),
    decimals = 2
  ) |>
  
  # Add color gradient to Gender Ratio
  # Values near 100 are balanced (white)
  # Values far from 100 show imbalance (colored)
  data_color(
    columns = `Gender Ratio`,
    palette = c("#FFD700", "#ffffff", "#22d3ee"),
    domain = c(90, 120)
  ) |>
  
  # Highlight State column
  tab_style(
    style = cell_fill(color = "#f8f9fa"),
    locations = cells_body(columns = State)
  ) |>
  
  # Add footnote explaining Gender Ratio
  tab_footnote(
    footnote = "Gender Ratio represents males per 100 females. Values near 100 indicate balance.",
    locations = cells_column_labels(columns = `Gender Ratio`)
  ) |>
  
  # Apply pre-built theme
  gt_theme_538(quiet = TRUE) |>
  
  # Adjust font sizes
  tab_options(
    table.font.size = px(12),
    heading.title.font.size = px(16),
    footnotes.font.size = px(10)
  )
Gender Distribution by State
Top 10 Most Populous States
Male Female Total Male % Female % Gender Ratio1
Jonglei 734,327 624,275 1,358,602 54.05 45.95 117.63
Central Equatoria 581,722 521,835 1,103,557 52.71 47.29 111.48
Warrap 470,734 502,194 972,928 48.38 51.62 93.74
Upper Nile 525,430 438,923 964,353 54.49 45.51 119.71
Eastern Equatoria 465,187 440,974 906,161 51.34 48.66 105.49
1 Gender Ratio represents males per 100 females. Values near 100 indicate balance.
ImportantUnderstanding Gender Ratio

Gender Ratio = (Males / Females) × 100

  • Ratio = 100: Perfect balance (equal males and females)
  • Ratio > 100: More males than females
  • Ratio < 100: More females than males

For example: - Ratio of 105 means 105 males per 100 females (5% more males) - Ratio of 95 means 95 males per 100 females (5% fewer males)

Step 19: Visualize State Gender Distribution

state_gender |>
  head(5) |>
  
  # Convert from wide to long format for plotting
  # Need separate rows for Male and Female to create grouped bars
  pivot_longer(
    cols = c(Male, Female),
    names_to = "gender",
    values_to = "population"
  ) |>
  
  # Reorder states by total population for better visualization
  mutate(state = fct_reorder(state, total)) |>
  
  # Create plot
  ggplot(aes(x = state, y = population, fill = gender)) +
  
  # Grouped bar chart (bars side by side)
  geom_col(position = "dodge", alpha = 0.9, width = 0.7) +
  
  # Set colors
  scale_fill_manual(
    values = c(
      "Male" = "#22d3ee",
      "Female" = "#FFD700"
    )
  ) +
  
  # Format y-axis labels (show as "100K" instead of "100000")
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "K")) +
  
  # Flip coordinates (horizontal bars are easier to read)
  coord_flip() +
  
  # Add labels
  labs(
    title = "**Population by State and Gender**",
    subtitle = "Top 5 Most Populous States | South Sudan 2008 Census",
    x = NULL,  # Remove x-axis label (it says "state" which is obvious)
    y = "Population",
    fill = "Gender"
  ) +
  
  # Customize theme
  theme(
    panel.grid.major.y = element_blank(),  # Remove horizontal grid lines
    panel.grid.major.x = element_line(color = "#e5e5e5"),
    legend.position = "top"
  )

Population by state and gender for the top 5 most populous states
TipChoosing the Right Chart Type

Grouped Bar Chart (what we used): - Best for: Comparing categories across groups - Shows: Exact values for each category - Advantage: Easy to compare Male vs Female within each state

Stacked Bar Chart (alternative): - Best for: Showing part-to-whole relationships - Shows: Total and composition - Advantage: Shows total population at a glance

Why coord_flip()? Long state names are easier to read horizontally than at an angle!


Part 6: Age Category Analysis

Step 20: Calculate National Age Distribution

age_summary <- census_filtered |>
  
  # Group by age category
  group_by(age_category) |>
  
  # Sum population for each age group
  summarise(
    population = sum(population),
    .groups = "drop"
  ) |>
  
  # Calculate percentages
  mutate(
    percentage = population / sum(population) * 100,
    percentage_label = percent(percentage / 100, accuracy = 0.01)
  ) |>
  
  # Sort by population (largest first)
  arrange(desc(population))

# Display results
age_summary
# A tibble: 7 × 4
  age_category population percentage percentage_label
  <chr>             <int>      <dbl> <chr>           
1 0-14            3659337      44.3  44.30%          
2 15-24           1628835      19.7  19.72%          
3 25-34           1234926      14.9  14.95%          
4 35-44            815517       9.87 9.87%           
5 45-54            473365       5.73 5.73%           
6 55-64            237426       2.87 2.87%           
7 65+              211084       2.56 2.56%           

Step 21: Display Age Distribution Table

age_summary |>
  select(
    `Age Category` = age_category, 
    Population = population,
    Percentage = percentage_label
  ) |>
  
  # Create table
  gt() |>
  
  # Add header
  tab_header(
    title = md("**Population Distribution by Age Category**"),
    subtitle = "National Summary"
  ) |>
  
  # Format population with commas
  fmt_number(
    columns = Population,
    decimals = 0,
    use_seps = TRUE
  ) |>
  
  # Add color gradient based on population size
  data_color(
    columns = Population,
    palette = c("#000000", "#0891b2", "#22d3ee", "#FFD700")
  ) |>
  
  # Make text white on colored backgrounds
  tab_style(
    style = cell_text(color = "white", weight = "bold"),
    locations = cells_body(columns = Population)
  ) |>
  
  # Add vertical divider between columns
  gt_add_divider(columns = `Age Category`, color = "#e5e5e5") |>
  
  # Adjust font sizes
  tab_options(
    table.font.size = px(13),
    heading.title.font.size = px(16)
  )
Population Distribution by Age Category
National Summary
Age Category Population Percentage
0-14 3,659,337 44.30%
15-24 1,628,835 19.72%
25-34 1,234,926 14.95%
35-44 815,517 9.87%
45-54 473,365 5.73%
55-64 237,426 2.87%
65+ 211,084 2.56%

Step 22: Visualize Age Distribution

age_summary |>
  
  # Reorder age categories by population for better visual ranking
  mutate(age_category = fct_reorder(age_category, population)) |>
  
  # Create plot
  ggplot(aes(x = age_category, y = population, fill = population)) +
  
  # Bar chart
  geom_col(alpha = 0.9, show.legend = FALSE) +
  
  # Add text labels showing exact population
  geom_text(
    aes(label = comma(population)),
    hjust = -0.1,  # Position slightly outside the bar
    size = 3.5,
    fontface = "bold",
    color = "#06b6d4"
  ) +
  
  # Color gradient from dark to light
  scale_fill_gradient(low = "#000000", high = "#FFD700") +
  
  # Format y-axis and add space for text labels
  scale_y_continuous(
    labels = label_number(scale = 1e-3, suffix = "K"),
    expand = expansion(mult = c(0, 0.15))  # Add 15% space on right for labels
  ) +
  
  # Horizontal bars
  coord_flip() +
  
  # Labels
  labs(
    title = "**Population Distribution by Age Category**",
    subtitle = "South Sudan 2008 Census | National Summary",
    x = NULL,
    y = "Population"
  ) +
  
  # Theme adjustments
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.major.x = element_line(color = "#e5e5e5")
  )

Population distribution across age categories with exact counts labeled
NoteUnderstanding scale_y_continuous()

expansion(mult = c(0, 0.15)) controls space around the plot:

  • First value (0): No extra space on the left
  • Second value (0.15): Add 15% extra space on the right

Why? To make room for our text labels showing exact population counts!

Without this, the labels would get cut off at the edge of the plot.


dplyr 1.2.0 & stringr 1.6.0: Migration Quick Reference

This tutorial showcased several new functions introduced in dplyr 1.2.0 and stringr 1.6.0. Here is a consolidated before/after reference for migrating your existing code:

Task Before After (2026)
Drop rows filter(x != "Total") filter_out(x == "Total")
Multi-column drop filter(a != "X", b != "Y") filter_out(when_any(c(a, b), ~ . == "X"))
Recode all values case_match(x, "a" ~ 1, "b" ~ 2) recode_values(x, "a" ~ 1, "b" ~ 2)
Recode from lookup age_map[x] (named vector) recode_values(x, from = tbl$from, to = tbl$to)
Replace few values if_else(x == "old", "new", x) replace_values(x, "old" ~ "new")
Conditional replace if_else(x > 5, 0, x) replace_when(x, x > 5 ~ 0)
To snake_case snakecase::to_snake_case(x) str_to_snake(x)
To camelCase manual regex str_to_camel(x)
To kebab-case gsub(" ", "-", tolower(x)) str_to_kebab(x)
Case-insensitive LIKE str_detect(x, regex("pat", TRUE)) str_ilike(x, "pat")
NoteDeprecation Timeline
  • recode() — superseded since dplyr 1.1.0; migrate to recode_values() or replace_values()
  • case_match() — soft-deprecated in dplyr 1.2.0; migrate to recode_values()
  • str_like(ignore_case = TRUE) — deprecated in stringr 1.6.0; use str_ilike() instead

These old functions continue to work but will emit deprecation warnings. New code should use the replacements above.


Key Insights

ImportantWhat the Data Tells Us

1. Population Concentration

The top five states account for a significant portion of the national population. This geographic concentration should inform resource allocation decisions.

2. Youth Demographics

The age distribution reveals a young population—typical of developing nations. This “youth bulge” represents both: - Opportunity: Large workforce potential - Challenge: Need for education and employment infrastructure

3. Gender Balance

Most states show relatively balanced gender distributions, with some variation that may reflect: - Migration patterns - Conflict impacts
- Data collection methodology

4. Regional Disparities

Substantial population differences between states suggest the need for: - Differentiated development strategies - Targeted resource allocation - Context-specific policy interventions


Conclusion

Congratulations! You’ve completed a comprehensive demographic analysis using R and the tidyverse.

TipWhat You’ve Learned

Data Skills: ✅ Loading data from URLs with read_csv()
✅ Cleaning data with tidyverse functions
String splitting and parsing with str_split_i(), str_remove(), and regex
Extracting information from structured text (gender from “Population - Male (Number)”)
Recategorizing data with case_when() and recode_values() for age groups ✅ Dropping rows with filter_out() (dplyr 1.2.0) ✅ Grouping and summarizing with group_by() and summarise()
✅ Reshaping data with pivot_wider() and pivot_longer()
✅ Calculating percentages and ratios

Visualization Skills: ✅ Creating pie charts and bar charts
✅ Customizing colors and themes
✅ Adding informative labels and titles
✅ Using coord_flip() for horizontal layouts
✅ Understanding Grammar of Graphics principles

Table Skills: ✅ Building professional tables with gt
✅ Formatting numbers and percentages
✅ Adding colors and styling
✅ Creating informative footnotes

String Processing Skills: ✅ Multiple methods for text extraction (split, regex, remove, separate) ✅ Using separate_wider_delim() to split into multiple columns ✅ Using str_split_i() to extract specific pieces ✅ Case conversion with str_to_camel(), str_to_snake(), str_to_kebab() (stringr 1.6.0) ✅ Understanding when to use each method ✅ Regular expressions for pattern matching

Workflow Skills: ✅ Using the pipe operator |> for readable code
✅ Writing clear, commented code
✅ Creating reproducible analyses
✅ Structuring code in logical steps

NoteNext Steps for Learning

Beginner: 1. Practice with different datasets 2. Try modifying the colors and themes 3. Experiment with different chart types

Intermediate: 4. Learn about purrr for functional programming 5. Explore stringr for text manipulation 6. Study lubridate for date handling

Advanced: 7. Create interactive dashboards with Shiny 8. Build custom functions and packages 9. Contribute to open-source R projects

Resources: - R for Data Science - Free online book - RStudio Cheatsheets - Quick references - TidyTuesday - Weekly practice datasets



Alier Reng

Alier Reng

Founder, Lead Educator & Creative Director at PyStatR+

Alier Reng is a Data Scientist, Educator, and Founder of PyStatR+, a platform advancing open and practical data science education. His work blends analytics, philosophy, and storytelling to make complex ideas human and empowering. Knowledge is freedom. Data is truth's language — ethics and transparency, its grammar.


Editor’s Note

This tutorial reflects a deliberate editorial balance between accessibility and technical depth. While R offers many approaches to data manipulation, this guide emphasizes the tidyverse philosophy—particularly dplyr 1.2.0 for data transformation and stringr 1.6.0 for text processing—because these tools prioritize readability and consistency.

This edition highlights dplyr 1.2.0’s filter_out() for clearer row-dropping semantics and recode_values() with lookup tables for maintainable value mappings. We also introduce stringr 1.6.0’s case conversion trio—str_to_camel(), str_to_snake(), and str_to_kebab()—for seamless interoperability with Python, JavaScript, and API naming conventions.

This approach aligns with the PyStatR+ Charter by emphasizing clarity, honesty, and accessibility without unnecessary complexity.


Acknowledgements

This lesson is part of the broader PyStatR+ Learning Platform, developed with gratitude to mentors, learners, and the open-source community that continually advances the R ecosystem. Special thanks to Hadley Wickham, the tidyverse team, and the contributors who make tools like dplyr, stringr, and ggplot2 possible.


References


PyStatR+Learning Simplified. Communication Amplified. 🚀

Atoch — PyStatR+ Executive Assistant (FAQ) Online
Provides informational guidance based on publicly available PyStatR+ resources.
I am Atoch, the PyStatR+ Executive Assistant. I help visitors understand PyStatR+, answer frequently asked questions, and guide users to official resources.

How may I assist you today?
💬 Discussion

Join the Conversation

Share your thoughts, ask questions, or contribute insights