Demographic Patterns in South Sudan: A Tidyverse Exploration

A Beginner’s Guide to Data Analysis with R and the Tidyverse

Tidyverse

ggplot2

Demographics

Data Visualization

Learn data analysis step-by-step using R’s tidyverse!

Author

Alierwai Reng

Published

November 20, 2024

Demographic Patterns in South Sudan: A Tidyverse Exploration

Learn data analysis step-by-step using R’s tidyverse!

Introduction

Demographic Patterns in South Sudan: A Tidyverse Exploration

A Beginner’s Guide to Data Analysis with R and the Tidyverse

Learn data analysis step-by-step using R’s tidyverse! This beginner-friendly tutorial explores South Sudan’s demographics with clear explanations, beautiful visualizations, and professional tables.

Welcome to this hands-on data analysis tutorial! This guide showcases dplyr 1.2.0 for powerful data manipulation—including the new filter_out() and recode_values() functions—and introduces key stringr 1.6.0 functions for cleaning and transforming text data, including the new case conversion trio: str_to_camel(), str_to_snake(), and str_to_kebab().

By the end of this guide, you’ll understand how to:

Load and explore real-world census data
Clean and transform data using tidyverse functions
Calculate summary statistics and group-level metrics
Create beautiful visualizations with ggplot2
Build professional tables with gt

We will analyze South Sudan’s 2008 census data as a practical case study; however, the analytical techniques and workflows you will learn are fully transferable to any dataset across domains and contexts.

The data were obtained from the National Bureau of Statistics, South Sudan, via the Open Data for Africa platform: Population by Age and Sex (2008) — http://southsudan.opendataforafrica.org/fvjqdpe/population-by-age-and-sex-2008-south-sudan

What is the Tidyverse?

The tidyverse is a collection of R packages designed to work together seamlessly for data science. Think of it as a complete toolkit where all the tools fit perfectly in your hand. Key packages include:

dplyr — data manipulation (filter, mutate, select, group, summarize)—the focus of this tutorial
stringr — string manipulation (clean, extract, transform text)—introduced throughout this tutorial
ggplot2 - data visualization (charts and graphs)
tidyr - data tidying (reshape and clean)
readr - data import (read CSV, Excel, etc.)

These packages share a common design philosophy, making your code readable and your workflow intuitive!

Tested With

R 4.4.x, dplyr 1.2.0, stringr 1.6.0, ggplot2 3.5.x, gt 0.11.x.

What’s New in dplyr 1.2.0 & stringr 1.6.0

dplyr 1.2.0 (released February 2026) introduces powerful new tools:

filter_out() — the missing complement to filter(). Drop rows instead of keeping them, with cleaner boolean logic.
recode_values() — create entirely new columns by mapping old values to new values. Replaces case_match() with a cleaner formula or from/to interface.
replace_values() — partially update an existing column while preserving its type.
replace_when() — conditionally replace rows within columns, a type-stable alternative to if_else().
when_any() and when_all() — elementwise OR/AND helpers for multi-column conditions.

stringr 1.6.0 (released November 2025) adds:

str_to_camel(), str_to_snake(), str_to_kebab() — convert between programming case conventions.
str_ilike() — case-insensitive SQL-like pattern matching.

We’ll showcase several of these throughout this tutorial!

Part 1: Environment Setup

Step 1: Load Required Packages

Every R analysis starts by loading the libraries (packages) we need. Think of packages as apps on your phone—each one adds specific capabilities.

# Core tidyverse packages (loads dplyr, ggplot2, tidyr, readr, and more!)
library(tidyverse)      
library(janitor)

# Table formatting packages
library(gt)             # Grammar of Tables - for beautiful tables
library(gtExtras)       # Extra features for gt tables

# Visualization enhancement packages
library(ggtext)         # Rich text formatting in ggplot2
library(scales)         # Scale functions for axes and labels
library(glue)           # Easy string interpolation

# Confirmation message
cat("✅ All packages loaded successfully!\n")

✅ All packages loaded successfully!

cat("📦 Tidyverse version:", as.character(packageVersion("tidyverse")), "\n")

📦 Tidyverse version: 2.0.0

Package Installation

If needed, install with: install.packages(c("tidyverse", "gt", "gtExtras", "ggtext", "scales", "glue"))

Step 2: Configure Visualization Theme

theme_set(
  theme_minimal(base_size = 13, base_family = "sans") +
    theme(
      plot.title = element_markdown(size = 16, face = "bold", color = "#06b6d4", margin = margin(b = 10)),
      plot.subtitle = element_markdown(size = 12, color = "#666666", margin = margin(b = 15)),
      plot.caption = element_markdown(size = 9, color = "#999999", hjust = 0),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_line(color = "#e5e5e5", linewidth = 0.3),
      legend.position = "top",
      legend.title = element_text(face = "bold", size = 11),
      axis.title = element_text(face = "bold", size = 11)
    )
)

cat("🎨 Custom theme configured!\n")

🎨 Custom theme configured!

Part 2: Loading and Exploring Data

Step 3: Load Census Data from URL

url <- "https://raw.githubusercontent.com/tongakuot/r_tutorials/main/06-data-wrangling/00-input/ss_2008_census_data_raw.csv"

census_raw <- read_csv(url, show_col_types = FALSE)

cat("✅ Data loaded successfully!\n")

✅ Data loaded successfully!

cat("📋 Dimensions:", nrow(census_raw), "rows ×", ncol(census_raw), "columns\n")

📋 Dimensions: 453 rows × 10 columns

Step 4: Examine the Data Structure

glimpse(census_raw)

Rows: 453
Columns: 10
$ Region              <chr> "KN.A2", "KN.A2", "KN.A2", "KN.A2", "KN.A2", "KN.A…
$ `Region Name`       <chr> "Upper Nile", "Upper Nile", "Upper Nile", "Upper N…
$ `Region - RegionId` <chr> "SS-NU", "SS-NU", "SS-NU", "SS-NU", "SS-NU", "SS-N…
$ Variable            <chr> "KN.B2", "KN.B2", "KN.B2", "KN.B2", "KN.B2", "KN.B…
$ `Variable Name`     <chr> "Population, Total (Number)", "Population, Total (…
$ Age                 <chr> "KN.C1", "KN.C2", "KN.C3", "KN.C4", "KN.C5", "KN.C…
$ `Age Name`          <chr> "Total", "0 to 4", "5 to 9", "10 to 14", "15 to 19…
$ Scale               <chr> "units", "units", "units", "units", "units", "unit…
$ Units               <chr> "Persons", "Persons", "Persons", "Persons", "Perso…
$ `2008`              <dbl> 964353, 150872, 151467, 126140, 103804, 82588, 767…

Step 5: Preview the Data

# head() shows the first n rows
census_raw |>
  head(10)

# A tibble: 10 × 10
   Region `Region Name` `Region - RegionId` Variable `Variable Name`       Age  
   <chr>  <chr>         <chr>               <chr>    <chr>                 <chr>
 1 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C1
 2 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C2
 3 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C3
 4 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C4
 5 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C5
 6 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C6
 7 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C7
 8 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C8
 9 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C9
10 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C…
# ℹ 4 more variables: `Age Name` <chr>, Scale <chr>, Units <chr>, `2008` <dbl>

Part 3: Data Cleaning

Raw data is rarely analysis-ready. We need to clean and standardize it first!

Introducing stringr for Text Cleaning

The stringr package (part of tidyverse) provides consistent, intuitive functions for string manipulation. In this section, we’ll use several key functions:

Function	Purpose	Example
`str_to_lower()`	Convert to lowercase	“HELLO” → “hello”
`str_to_upper()`	Convert to UPPERCASE	“hello” → “HELLO”
`str_to_title()`	Convert to Title Case	“hello world” → “Hello World”
`str_to_sentence()`	Sentence case	“hello world” → “Hello world”
`str_to_camel()`	Convert to camelCase	“quick brown fox” → “quickBrownFox”
`str_to_snake()`	Convert to snake_case	“Quick Brown Fox” → “quick_brown_fox”
`str_to_kebab()`	Convert to kebab-case	“Quick Brown Fox” → “quick-brown-fox”
`str_squish()`	Remove extra whitespace	” hello world ” → “hello world”
`str_replace_all()`	Replace patterns	“a b c” → “a_b_c”

The first seven are case conversion functions. The trio str_to_camel(), str_to_snake(), and str_to_kebab() are new in stringr 1.6.0 and convert between programming naming conventions—essential when bridging R data with Python, JavaScript, or API outputs.

stringr 1.6.0: Case Conversion in Action

Before stringr 1.6.0, converting between camelCase, snake_case, and kebab-case required manual regex or external packages like snakecase. Now these conversions are built in:

# New in stringr 1.6.0: convert between programming case conventions
demo_text <- "south sudan census data"

cat("Original:    ", demo_text, "\n")
cat("camelCase:   ", str_to_camel(demo_text), "\n")
cat("PascalCase:  ", str_to_camel(demo_text, first_upper = TRUE), "\n")
cat("snake_case:  ", str_to_snake(demo_text), "\n")
cat("kebab-case:  ", str_to_kebab(demo_text), "\n")
cat("Title Case:  ", str_to_title(demo_text), "\n")

Transition: Old → New Case Conversion

Before stringr 1.6.0 — manual regex or external package:

# Required snakecase package or manual work
snakecase::to_snake_case("South Sudan Census")
gsub(" ", "-", tolower("South Sudan Census"))

After stringr 1.6.0 — native, consistent, pipe-friendly:

# Built into stringr — works seamlessly in tidyverse pipelines
"South Sudan Census" |> str_to_snake()   # "south_sudan_census"
"South Sudan Census" |> str_to_kebab()   # "south-sudan-census"
"South Sudan Census" |> str_to_camel()   # "southSudanCensus"

These are particularly valuable when bridging R data with Python (snake_case), JavaScript (camelCase), or URL slugs (kebab-case).

Step 6: Clean and Transform the Dataset

Here’s a practical cleaning pipeline using the janitor package for automatic name standardization:

census_clean <-
  census_raw |>

  # Standardize column names automatically
   clean_names() |>

  # Rename to meaningful names
  select(
    state = region_name,
    gender = variable_name,
    age_category = age_name,
    population = x2008
  ) |>

  # Clean text columns and convert population to integer
  mutate(
    across(where(is.character), \(x) str_squish(x) |> str_to_title()),
    population = as.integer(population)
  ) |>

  # Remove rows with missing or invalid data
  filter(!is.na(population), population > 0)

# Display results
cat("✅ Data cleaning complete!\n")

✅ Data cleaning complete!

cat("📊 Cleaned dataset:", nrow(census_clean), "rows ×", ncol(census_clean), "columns\n")

📊 Cleaned dataset: 450 rows × 4 columns

The Cleaning Pipeline Explained

Key steps: - clean_names(): Automatically standardizes column names (lowercase, underscores, safe for R) - select(): Chooses columns and renames them simultaneously - across(where(is.character), ...): Applies the same transformation to all text columns at once - str_squish() + str_to_title(): Removes extra spaces and capitalizes properly - as.integer(): Converts population to whole numbers (appropriate for count data) - filter(): Removes rows with missing or invalid population values

This single approach combines efficiency with clarity!

Step 7: Preview the Cleaned Data

census_clean |>
  head(10)

# A tibble: 10 × 4
   state      gender                     age_category population
   <chr>      <chr>                      <chr>             <int>
 1 Upper Nile Population, Total (Number) Total            964353
 2 Upper Nile Population, Total (Number) 0 To 4           150872
 3 Upper Nile Population, Total (Number) 5 To 9           151467
 4 Upper Nile Population, Total (Number) 10 To 14         126140
 5 Upper Nile Population, Total (Number) 15 To 19         103804
 6 Upper Nile Population, Total (Number) 20 To 24          82588
 7 Upper Nile Population, Total (Number) 25 To 29          76754
 8 Upper Nile Population, Total (Number) 30 To 34          63134
 9 Upper Nile Population, Total (Number) 35 To 39          56806
10 Upper Nile Population, Total (Number) 40 To 44          42139

Part 3B: Advanced String Processing

Now we need to extract more meaningful information from our data. The gender column actually contains structured text like “Population - Male (Number)” that we can parse!

Step 8: Examine the Gender Column Structure

# See unique values in gender column
cat("🔍 Unique values in gender column:\n")

🔍 Unique values in gender column:

census_clean |>
  distinct(gender) |>
  pull(gender)

[1] "Population, Total (Number)"  "Population, Male (Number)"  
[3] "Population, Female (Number)"

The gender column follows a pattern: “Population - Gender (Type)” where we need to extract just the gender value.

Step 9: Extract Gender Information

The gender column contains structured text like “Population - Male (Number)”. We’ll extract just the gender value using the most practical approaches.

Method 1: Using str_split_i() — Simple and Direct

Split the text and extract the piece you need:

# Apply gender extraction to our dataset
census_parsed <- census_clean |>
  mutate(
    # Split on " ", extract 2nd piece, then remove "(Number)" text
    gender = str_split_i(gender, " ", 2) |>
             str_remove(" \\(.*\\)") |>
             str_squish()
  )

# Verify extraction worked
cat("✅ Gender extraction complete!\n")

✅ Gender extraction complete!

cat("🎯 Unique gender values:\n")

🎯 Unique gender values:

census_parsed |>
  distinct(gender) |>
  pull(gender)

[1] "Total"  "Male"   "Female"

How This Works

str_split_i(gender, " ", 2): Split on spaces, extract the 2nd piece
- “Population - Male (Number)” → “- Male (Number)”
str_remove(" \$.*\$"): Remove ” (anything)” pattern
- “- Male (Number)” → “- Male”
str_squish(): Clean extra whitespace
- “- Male” → “Male”

This is the most practical approach for extraction tasks!

Method 2: Using separate_wider_delim() — When You Need Multiple Pieces

If you need to keep multiple pieces from a split, use separate_wider_delim():

# Demonstrate separate_wider_delim()
demo_separate <- census_clean |>
  select(original_gender = gender) |>
  separate_wider_delim(
    cols = original_gender,
    delim = " ",
    names = c("prefix", "gender_raw", "suffix"),
    too_few = "align_start",
    cols_remove = FALSE
  ) |>
  mutate(gender_clean = str_remove(gender_raw, " \\(.*\\)")) |>
  select(original_gender, prefix, gender_raw, gender_clean) |>
  distinct() |>
  head(3)

cat("✅ separate_wider_delim() keeps all pieces:\n")

✅ separate_wider_delim() keeps all pieces:

demo_separate

# A tibble: 3 × 4
  original_gender             prefix      gender_raw gender_clean
  <chr>                       <chr>       <chr>      <chr>       
1 Population, Total (Number)  Population, Total      Total       
2 Population, Male (Number)   Population, Male       Male        
3 Population, Female (Number) Population, Female     Female

When to Use Each Method

str_split_i() (Method 1): - You only need one piece - More concise code - Recommended for this task

separate_wider_delim() (Method 2): - You need multiple pieces as separate columns - Better for data reshaping workflows - Excellent when working with structured delimited text

For our case: We use Method 1 because we only extract the gender value.

Method 3: Regular Expressions — For Complex Patterns

For more complex text patterns, use regex with str_extract():

# Regex approach (for reference/learning)
demo_regex <- census_clean |>
  select(original_gender = gender) |>
  mutate(
    # Regex: text between " " and " ("
    gender_regex = str_extract(original_gender, "(?<= ).*(?= \\()")
  ) |>
  distinct() |>
  head(3)

cat("✅ Regex extraction with lookaround assertions:\n")

✅ Regex extraction with lookaround assertions:

demo_regex

# A tibble: 3 × 2
  original_gender             gender_regex
  <chr>                       <chr>       
1 Population, Total (Number)  Total       
2 Population, Male (Number)   Male        
3 Population, Female (Number) Female

Regex Lookaround Basics

"(?<= ).*(?= \\()" extracts text between two patterns: - (?<= ) — Lookbehind: preceded by space-dash-space - .* — Match: any characters - (?= \\() — Lookahead: followed by ” (”

Result: Extracts “Male” from “Population - Male (Number)”

Additional approaches (regex lookahead, separate + unite, etc.) are covered in the stringr documentation.

Step 10: Recategorize Age Groups

Let’s group the fine-grained 5-year age bands into broader, more interpretable categories:

cat("🔍 Current age categories:\n")

🔍 Current age categories:

census_parsed |>
  distinct(age_category) |>
  arrange(age_category) |>
  pull(age_category)

 [1] "0 To 4"   "10 To 14" "15 To 19" "20 To 24" "25 To 29" "30 To 34"
 [7] "35 To 39" "40 To 44" "45 To 49" "5 To 9"   "50 To 54" "55 To 59"
[13] "60 To 64" "65+"      "Total"

Method 1: Using case_when() for Conditional Recategorization

The case_when() function is perfect for complex, multi-condition transformations:

census_final <- census_parsed |>
  mutate(
    age_category  = str_to_lower(age_category),
    age_category = case_when(
      # Children (0-14)
      age_category %in% c("0 to 4", "5 to 9", "10 to 14") ~ "0-14",
      
      # Youth (15-24)
      age_category %in% c("15 to 19", "20 to 24") ~ "15-24",
      
      # Early working age (25-34)
      age_category %in% c("25 to 29", "30 to 34") ~ "25-34",
      
      # Middle working age (35-44)
      age_category %in% c("35 to 39", "40 to 44") ~ "35-44",
      
      # Later working age (45-54)
      age_category %in% c("45 to 49", "50 to 54") ~ "45-54",
      
      # Pre-retirement (55-64)
      age_category %in% c("55 to 59", "60 to 64") ~ "55-64",
      
      # Retirement age (65+)
      age_category == "65+" ~ "65+",
      
      # Catch any unexpected values
      TRUE ~ age_category
    )
  )

# Verify recategorization
cat("✅ Age categories recategorized!\n")

✅ Age categories recategorized!

cat("🎯 New age categories:\n")

🎯 New age categories:

census_final |> 
  distinct(age_category) |> 
  arrange(age_category) |>
  pull(age_category)

[1] "0-14"  "15-24" "25-34" "35-44" "45-54" "55-64" "65+"   "total"

Understanding case_when()

case_when() is like a multi-way IF statement (similar to SQL’s CASE WHEN):

Structure:

case_when(
  condition1 ~ result1,  # If condition1 is TRUE, return result1
  condition2 ~ result2,  # Else if condition2 is TRUE, return result2
  condition3 ~ result3,  # Else if condition3 is TRUE, return result3
  TRUE ~ default         # Else return default (catch-all)
)

Key points: - Conditions are evaluated in order (first match wins) - %in% checks if value is in a vector (like “is one of”) - ~ separates condition from result - TRUE ~ at the end catches anything not matched above

Example for our data: - If age is “0 to 4” OR “5 to 9” OR “10 to 14” → return “0-14” - Else if age is “15 to 19” OR “20 to 24” → return “15-24” - And so on…

Method 2: Using recode_values() — New in dplyr 1.2.0

For value-to-value mappings, the new recode_values() is cleaner than case_when(). It accepts a from/to lookup table or formula syntax:

# Build a lookup table — clean and portable
age_lookup <- tribble(
  ~from,       ~to,
  "0 to 4",    "0-14",
  "5 to 9",    "0-14",
  "10 to 14",  "0-14",
  "15 to 19",  "15-24",
  "20 to 24",  "15-24",
  "25 to 29",  "25-34",
  "30 to 34",  "25-34",
  "35 to 39",  "35-44",
  "40 to 44",  "35-44",
  "45 to 49",  "45-54",
  "50 to 54",  "45-54",
  "55 to 59",  "55-64",
  "60 to 64",  "55-64",
  "65+",       "65+"
)

# Demonstrate recode_values() (not applied to dataset)
demo_recode <- census_parsed |>
  select(age_category) |>
  mutate(
    age_category = str_to_lower(age_category),
    age_recoded = recode_values(
      age_category,
      from = age_lookup$from,
      to = age_lookup$to
    )
  ) |>
  distinct()

cat("✅ recode_values() demonstration:\n")

✅ recode_values() demonstration:

demo_recode

# A tibble: 15 × 2
   age_category age_recoded
   <chr>        <chr>      
 1 total        <NA>       
 2 0 to 4       0-14       
 3 5 to 9       0-14       
 4 10 to 14     0-14       
 5 15 to 19     15-24      
 6 20 to 24     15-24      
 7 25 to 29     25-34      
 8 30 to 34     25-34      
 9 35 to 39     35-44      
10 40 to 44     35-44      
11 45 to 49     45-54      
12 50 to 54     45-54      
13 55 to 59     55-64      
14 60 to 64     55-64      
15 65+          65+

Comparing case_when() vs recode_values()

Use case_when() when:

Multiple conditions per category (AND/OR logic)
Complex conditional expressions
Best when conditions aren’t simple equality checks

Use recode_values() (dplyr 1.2.0) when:

Simple value-to-value mappings
Lookup table stored externally (CSV, tribble)
Cleaner syntax for 1-to-1 replacements
Replaces the superseded case_match() and recode()

Both work for our use case. We use case_when() above because it groups related age bands together, making the logic visible. For production code with many mappings, recode_values() with a lookup table is more maintainable.

Transition: Old → New Recoding Functions

Superseded — recode() (avoid in new code):

recode(x, "0 to 4" = "0-14", "5 to 9" = "0-14", .default = x)

Soft-deprecated — case_match() (migrate to recode_values()):

case_match(x, c("0 to 4", "5 to 9", "10 to 14") ~ "0-14", .default = x)

New — recode_values() with formula syntax:

recode_values(x, "0 to 4" ~ "0-14", "5 to 9" ~ "0-14", default = x)

New — recode_values() with lookup table (recommended for many mappings):

recode_values(x, from = age_lookup$from, to = age_lookup$to)

New — replace_values() for partial updates (preserves column type):

replace_values(x, "Total" ~ "All Genders")

The recode_values() function replaces both recode() and case_match(). Use replace_values() when you only need to change a few values while keeping the rest intact.

Step 11: Filter and Verify Final Dataset

Remove aggregate rows before analysis. In dplyr 1.2.0, the new filter_out() makes this intent explicit—you specify what to drop rather than negate conditions:

# Remove rows where gender or age_category is "Total"
# In dplyr 1.2.0, you can use: filter_out(gender == "Total" | age_category == "Total")
census_filtered <- census_final |>
  filter_out(gender == "Total" | age_category == "total")

census_filtered

# A tibble: 280 × 4
   state      gender age_category population
   <chr>      <chr>  <chr>             <int>
 1 Upper Nile Male   0-14              82690
 2 Upper Nile Male   0-14              83744
 3 Upper Nile Male   0-14              71027
 4 Upper Nile Male   15-24             57387
 5 Upper Nile Male   15-24             42521
 6 Upper Nile Male   25-34             38795
 7 Upper Nile Male   25-34             32236
 8 Upper Nile Male   35-44             30228
 9 Upper Nile Male   35-44             22290
10 Upper Nile Male   45-54             18163
# ℹ 270 more rows

Transition: Old → New Row Filtering

Before dplyr 1.2.0 — negated logic with filter():

# Awkward double negation — "keep rows where gender is NOT Total AND age is NOT Total"
census_final |> filter(gender != "Total", age_category != "Total")

After dplyr 1.2.0 — direct intent with filter_out():

# Clear intent — "drop rows where gender IS Total OR age IS Total"
census_final |> filter_out(gender == "Total" | age_category == "total")

Why this matters:

filter() keeps rows that match → forces negated logic to drop rows (!=, !)
filter_out() drops rows that match → write positive conditions, cleaner boolean logic
filter_out() handles NA values more predictably (rows with NA are kept, not silently dropped)

Bonus — when_any() and when_all() for multi-column conditions:

# Drop rows where ANY column equals "Total" (works across many columns!)
census_final |> filter_out(when_any(c(gender, age_category), ~ . == "Total"))

These helpers compose naturally with both filter() and filter_out(), making multi-column conditions readable.

Let’s confirm all transformations worked correctly:

# Generate a comprehensive summary of the cleaned and transformed dataset
# This verification step ensures all transformations were applied correctly

cat(strrep("=", 50), "\n", sep = "")

==================================================

cat("🎉 DATA TRANSFORMATION COMPLETE!\n")

🎉 DATA TRANSFORMATION COMPLETE!

cat(strrep("=", 50), "\n", sep = "")

==================================================

# Display dimensions of the final cleaned dataset
cat("📊 Final dataset dimensions:\n")

📊 Final dataset dimensions:

cat("  Rows:", nrow(census_filtered), "\n")

  Rows: 280

cat("  Columns:", ncol(census_filtered), "\n\n")

  Columns: 4

# List all column names in the cleaned dataset
cat("✅ Column names:\n")

✅ Column names:

cat("  ", paste(names(census_final), collapse = ", "), "\n\n")

   state, gender, age_category, population

# Show all unique gender categories (should only be "Male", "Female", and "Total" before filtering)
cat("🎯 Unique gender values:\n")

🎯 Unique gender values:

census_filtered |> distinct(gender) |> pull(gender) |> cat("  ", "\n")

Male Female

# Show all unique age categories in sorted order
cat("\n🎯 Unique age categories:\n")


🎯 Unique age categories:

census_filtered |> distinct(age_category) |> arrange(age_category) |> pull(age_category) |> cat("  ", "\n")

0-14 15-24 25-34 35-44 45-54 55-64 65+

# Display first 10 rows of key columns to visually inspect the data
cat("\n📋 Sample of final data:\n")


📋 Sample of final data:

census_filtered |> 
  select(state, gender, age_category, population) |>
  head(10)

# A tibble: 10 × 4
   state      gender age_category population
   <chr>      <chr>  <chr>             <int>
 1 Upper Nile Male   0-14              82690
 2 Upper Nile Male   0-14              83744
 3 Upper Nile Male   0-14              71027
 4 Upper Nile Male   15-24             57387
 5 Upper Nile Male   15-24             42521
 6 Upper Nile Male   25-34             38795
 7 Upper Nile Male   25-34             32236
 8 Upper Nile Male   35-44             30228
 9 Upper Nile Male   35-44             22290
10 Upper Nile Male   45-54             18163

Data Transformation Summary

What we accomplished:

Original gender column values: - “Population - Male (Number)” - “Population - Female (Number)” - “Population - Total (Number)”

Transformed to clean format: - “Male” - “Female” - “Total”

Original age_category column structure: - 16 individual 5-year age bands - Examples: “0 To 4”, “5 To 9”, “10 To 14”, …, “65+”

Transformed to standardized life-stage categories: - 7 broader, more interpretable age groups - “0-14”, “15-24”, “25-34”, “35-44”, “45-54”, “55-64”, “65+”

Result: A clean, standardized dataset ready for analysis and visualization! ✨

Part 4: Data Exploration and Summary

Step 12: Create Overview Statistics

Let’s calculate some key statistics about our dataset:

# Create a summary table
overview_table <- census_filtered |>
  summarise(
    `Total Population` = comma(sum(population)),     # Format with commas
    `Number of States` = n_distinct(state),          # Count unique states
    `Age Categories` = n_distinct(age_category),     # Count unique ages
    `Gender Groups` = n_distinct(gender),            # Count unique genders
    `Total Observations` = comma(n())                # Count all rows
  )

# Display the summary
overview_table

# A tibble: 1 × 5
  `Total Population` `Number of States` `Age Categories` `Gender Groups`
  <chr>                           <int>            <int>           <int>
1 8,260,490                          10                7               2
# ℹ 1 more variable: `Total Observations` <chr>

Understanding summarise()

summarise() collapses data into summary statistics:

sum() - adds up values
n_distinct() - counts unique values
n() - counts total rows
comma() - formats numbers with commas (from scales package)

It reduces many rows into one row of summaries!

Step 13: Display as Professional Table

Now let’s make this summary look professional using the gt package:

overview_table |>
  gt() |>
  tab_header(
    title = md("**South Sudan 2008 Census Overview**"),
    subtitle = "Key Summary Statistics"
  ) |>
  tab_style(
    style = cell_fill(color = "#22d3ee"),
    locations = cells_body()
  ) |>
  tab_style(
    style = cell_text(color = "white", weight = "bold"),
    locations = cells_body()
  ) |>
  tab_options(
    table.font.size = px(13),
    heading.title.font.size = px(16),
    heading.subtitle.font.size = px(12)
  )

Total Population	Number of States	Age Categories	Gender Groups	Total Observations
South Sudan 2008 Census Overview
Key Summary Statistics
8,260,490	10	7	2	280

The Grammar of Tables (gt)

The gt package uses a layered approach (like ggplot2 for tables):

Start with data → gt()
Add headers → tab_header()
Style cells → tab_style()
Format numbers → fmt_number()
Adjust options → tab_options()

Each layer adds or modifies the table appearance!

Part 5: Gender Analysis

Step 14: Calculate National Gender Distribution

Let’s analyze how the population is distributed by gender:

gender_summary <- census_filtered |>
  
  # Step 1: Group data by gender
  group_by(gender) |>
  
  # Step 2: Calculate total population for each gender
  summarise(
    population = sum(population),
    .groups = "drop"  # Remove grouping after summarise
  ) |>
  
  # Step 3: Calculate percentages
  mutate(
    percentage = population / sum(population) * 100,
    percentage_label = percent(percentage / 100, accuracy = 0.01)
  ) |>
  
  # Step 4: Sort by population (largest first)
  arrange(desc(population))

# Display the results
gender_summary

# A tibble: 2 × 4
  gender population percentage percentage_label
  <chr>       <int>      <dbl> <chr>           
1 Male      4287300       51.9 51.90%          
2 Female    3973190       48.1 48.10%

Understanding group_by() and summarise()

These two functions work together like a team:

group_by(gender) - Splits data into groups (one for Male, one for Female) - Like separating cards into piles

summarise(population = sum(population)) - Performs calculations within each group - sum() adds up all population values in each group - Like counting cards in each pile

.groups = "drop" - Removes the grouping after we’re done - Prevents unexpected behavior in future operations

Final result: One row per gender with total population!

Step 15: Display Gender Table

gender_summary |>
  # Rename columns for display
  select(
    Gender = gender, 
    Population = population, 
    `Percentage` = percentage_label
  ) |>
  
  # Create gt table
  gt() |>
  
  # Add title and subtitle
  tab_header(
    title = md("**National Gender Distribution**"),
    subtitle = "South Sudan 2008 Census"
  ) |>
  
  # Format population with commas
  fmt_number(
    columns = Population,
    decimals = 0,
    use_seps = TRUE
  ) |>
  
  # Style Male row (first row) - cyan background
  tab_style(
    style = list(
      cell_fill(color = "#22d3ee"),
      cell_text(color = "white", weight = "bold")
    ),
    locations = cells_body(rows = 1)
  ) |>
  
  # Style Female row (second row) - gold background
  tab_style(
    style = list(
      cell_fill(color = "#FFD700"),
      cell_text(color = "#000000", weight = "bold")
    ),
    locations = cells_body(rows = 2)
  ) |>
  
  # Center all columns
  cols_align(align = "center", columns = everything()) |>
  
  # Adjust font sizes
  tab_options(
    table.font.size = px(13),
    heading.title.font.size = px(16)
  )

Gender	Population	Percentage
National Gender Distribution
South Sudan 2008 Census
Male	4,287,300	51.90%
Female	3,973,190	48.10%

Step 16: Visualize Gender Distribution

Numbers are great, but visualizations make patterns instantly clear. Let’s create a pie chart:

ggplot(gender_summary, aes(x = "", y = population, fill = gender)) +
  
  # Create a bar chart (we'll turn it into a pie)
  geom_col(width = 1, color = "white", linewidth = 2) +
  
  # Convert bar chart to pie chart using polar coordinates
  coord_polar(theta = "y") +
  
  # Set custom colors for Male and Female
  scale_fill_manual(
    values = c(
      "Male" = "#22d3ee",
      "Female" = "#FFD700"
    )
  ) +
  
  # Add labels showing counts and percentages
  geom_text(
    aes(label = glue("{comma(population)}\n({percentage_label})")),
    position = position_stack(vjust = 0.5),  # Center in each slice
    size = 5,
    fontface = "bold",
    color = "white"
  ) +
  
  # Add titles and labels
  labs(
    title = "**Gender Distribution in South Sudan**",
    subtitle = "2008 Census Data",
    fill = "Gender"
  ) +
  
  # Use void theme for pie charts (removes axes)
  theme_void() +
  
  # Customize title and legend
  theme(
    plot.title = element_markdown(
      size = 16, 
      face = "bold", 
      color = "#06b6d4",
      hjust = 0.5,        # Center title
      margin = margin(b = 5)
    ),
    plot.subtitle = element_markdown(
      size = 12, 
      color = "#666666", 
      hjust = 0.5         # Center subtitle
    ),
    legend.position = "bottom",
    legend.title = element_text(face = "bold", size = 11)
  )

Gender distribution shown as a pie chart with population counts and percentages

Anatomy of a ggplot2 Chart

Every ggplot2 visualization follows this pattern:

Start with data → ggplot(data, aes(...))
Add geometry → geom_*() (point, line, bar, etc.)
Adjust scales → scale_*() (colors, axes, etc.)
Add labels → labs() (title, axes, etc.)
Apply theme → theme_*() (appearance)

Think of it like building with LEGO blocks—each layer adds something!

Bonus: coord_polar() transforms rectangular plots into circular ones (bar chart → pie chart)!

Step 17: Gender Distribution by State

Now let’s see how gender distribution varies across different states:

state_gender <- census_filtered |>
  
  # Group by both state AND gender
  group_by(state, gender) |>
  
  # Sum population within each state-gender combination
  summarise(population = sum(population), .groups = "drop") |>
  
  # Reshape from long to wide format
  # Before: Multiple rows per state (one for Male, one for Female)
  # After: One row per state (Male and Female as separate columns)
  pivot_wider(names_from = gender, values_from = population) |>
  
  # Calculate additional metrics
  mutate(
    total = Male + Female,                    # Total population
    male_pct = Male / total * 100,           # Male percentage
    female_pct = Female / total * 100,       # Female percentage
    gender_ratio = Male / Female * 100       # Males per 100 females
  ) |>
  
  # Sort by total population (largest first)
  arrange(desc(total))

# Display top 5 states
state_gender |> 
  head(5)

# A tibble: 5 × 7
  state             Female   Male   total male_pct female_pct gender_ratio
  <chr>              <int>  <int>   <int>    <dbl>      <dbl>        <dbl>
1 Jonglei           624275 734327 1358602     54.1       45.9        118. 
2 Central Equatoria 521835 581722 1103557     52.7       47.3        111. 
3 Warrap            502194 470734  972928     48.4       51.6         93.7
4 Upper Nile        438923 525430  964353     54.5       45.5        120. 
5 Eastern Equatoria 440974 465187  906161     51.3       48.7        105.

Understanding pivot_wider()

pivot_wider() reshapes data from long to wide format:

Before (Long format):

State      Gender  Population
Jonglei    Male    734327
Jonglei    Female  624275
Warrap     Male    470734
Warrap     Female  502194

After (Wide format):

State    Male    Female  Total
Jonglei  734327  624275   1358602
Warrap   470734  502194   972928

Why? Because it’s easier to calculate ratios and percentages when Male and Female are in separate columns!

Step 18: Display State Gender Table

state_gender |>
  head(5) |>
  
  # Select and rename columns for display
  select(
    State = state,
    Male,
    Female,
    Total = total,
    `Male %` = male_pct,
    `Female %` = female_pct,
    `Gender Ratio` = gender_ratio
  ) |>
  
  # Create table
  gt(rowname_col = "State") |>
  cols_align(columns = State, align = "right") |> 
  # Add header
  tab_header(
    title = md("**Gender Distribution by State**"),
    subtitle = "Top 10 Most Populous States"
  ) |>
  
  # Format population columns with commas
  fmt_number(
    columns = c(Male, Female, Total),
    decimals = 0,
    use_seps = TRUE
  ) |>
  
  # Format percentage and ratio columns
  fmt_number(
    columns = c(`Male %`, `Female %`, `Gender Ratio`),
    decimals = 2
  ) |>
  
  # Add color gradient to Gender Ratio
  # Values near 100 are balanced (white)
  # Values far from 100 show imbalance (colored)
  data_color(
    columns = `Gender Ratio`,
    palette = c("#FFD700", "#ffffff", "#22d3ee"),
    domain = c(90, 120)
  ) |>
  
  # Highlight State column
  tab_style(
    style = cell_fill(color = "#f8f9fa"),
    locations = cells_body(columns = State)
  ) |>
  
  # Add footnote explaining Gender Ratio
  tab_footnote(
    footnote = "Gender Ratio represents males per 100 females. Values near 100 indicate balance.",
    locations = cells_column_labels(columns = `Gender Ratio`)
  ) |>
  
  # Apply pre-built theme
  gt_theme_538(quiet = TRUE) |>
  
  # Adjust font sizes
  tab_options(
    table.font.size = px(12),
    heading.title.font.size = px(16),
    footnotes.font.size = px(10)
  )

	Male	Female	Total	Male %	Female %	Gender Ratio¹
Gender Distribution by State
Top 10 Most Populous States
Jonglei	734,327	624,275	1,358,602	54.05	45.95	117.63
Central Equatoria	581,722	521,835	1,103,557	52.71	47.29	111.48
Warrap	470,734	502,194	972,928	48.38	51.62	93.74
Upper Nile	525,430	438,923	964,353	54.49	45.51	119.71
Eastern Equatoria	465,187	440,974	906,161	51.34	48.66	105.49
¹ Gender Ratio represents males per 100 females. Values near 100 indicate balance.

Understanding Gender Ratio

Gender Ratio = (Males / Females) × 100

Ratio = 100: Perfect balance (equal males and females)
Ratio > 100: More males than females
Ratio < 100: More females than males

For example: - Ratio of 105 means 105 males per 100 females (5% more males) - Ratio of 95 means 95 males per 100 females (5% fewer males)

Step 19: Visualize State Gender Distribution

state_gender |>
  head(5) |>
  
  # Convert from wide to long format for plotting
  # Need separate rows for Male and Female to create grouped bars
  pivot_longer(
    cols = c(Male, Female),
    names_to = "gender",
    values_to = "population"
  ) |>
  
  # Reorder states by total population for better visualization
  mutate(state = fct_reorder(state, total)) |>
  
  # Create plot
  ggplot(aes(x = state, y = population, fill = gender)) +
  
  # Grouped bar chart (bars side by side)
  geom_col(position = "dodge", alpha = 0.9, width = 0.7) +
  
  # Set colors
  scale_fill_manual(
    values = c(
      "Male" = "#22d3ee",
      "Female" = "#FFD700"
    )
  ) +
  
  # Format y-axis labels (show as "100K" instead of "100000")
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "K")) +
  
  # Flip coordinates (horizontal bars are easier to read)
  coord_flip() +
  
  # Add labels
  labs(
    title = "**Population by State and Gender**",
    subtitle = "Top 5 Most Populous States | South Sudan 2008 Census",
    x = NULL,  # Remove x-axis label (it says "state" which is obvious)
    y = "Population",
    fill = "Gender"
  ) +
  
  # Customize theme
  theme(
    panel.grid.major.y = element_blank(),  # Remove horizontal grid lines
    panel.grid.major.x = element_line(color = "#e5e5e5"),
    legend.position = "top"
  )

Population by state and gender for the top 5 most populous states

Choosing the Right Chart Type

Grouped Bar Chart (what we used): - Best for: Comparing categories across groups - Shows: Exact values for each category - Advantage: Easy to compare Male vs Female within each state

Stacked Bar Chart (alternative): - Best for: Showing part-to-whole relationships - Shows: Total and composition - Advantage: Shows total population at a glance

Why coord_flip()? Long state names are easier to read horizontally than at an angle!

Part 6: Age Category Analysis

Step 20: Calculate National Age Distribution

age_summary <- census_filtered |>
  
  # Group by age category
  group_by(age_category) |>
  
  # Sum population for each age group
  summarise(
    population = sum(population),
    .groups = "drop"
  ) |>
  
  # Calculate percentages
  mutate(
    percentage = population / sum(population) * 100,
    percentage_label = percent(percentage / 100, accuracy = 0.01)
  ) |>
  
  # Sort by population (largest first)
  arrange(desc(population))

# Display results
age_summary

# A tibble: 7 × 4
  age_category population percentage percentage_label
  <chr>             <int>      <dbl> <chr>           
1 0-14            3659337      44.3  44.30%          
2 15-24           1628835      19.7  19.72%          
3 25-34           1234926      14.9  14.95%          
4 35-44            815517       9.87 9.87%           
5 45-54            473365       5.73 5.73%           
6 55-64            237426       2.87 2.87%           
7 65+              211084       2.56 2.56%

Step 21: Display Age Distribution Table

age_summary |>
  select(
    `Age Category` = age_category, 
    Population = population,
    Percentage = percentage_label
  ) |>
  
  # Create table
  gt() |>
  
  # Add header
  tab_header(
    title = md("**Population Distribution by Age Category**"),
    subtitle = "National Summary"
  ) |>
  
  # Format population with commas
  fmt_number(
    columns = Population,
    decimals = 0,
    use_seps = TRUE
  ) |>
  
  # Add color gradient based on population size
  data_color(
    columns = Population,
    palette = c("#000000", "#0891b2", "#22d3ee", "#FFD700")
  ) |>
  
  # Make text white on colored backgrounds
  tab_style(
    style = cell_text(color = "white", weight = "bold"),
    locations = cells_body(columns = Population)
  ) |>
  
  # Add vertical divider between columns
  gt_add_divider(columns = `Age Category`, color = "#e5e5e5") |>
  
  # Adjust font sizes
  tab_options(
    table.font.size = px(13),
    heading.title.font.size = px(16)
  )

Age Category	Population	Percentage
Population Distribution by Age Category
National Summary
0-14	3,659,337	44.30%
15-24	1,628,835	19.72%
25-34	1,234,926	14.95%
35-44	815,517	9.87%
45-54	473,365	5.73%
55-64	237,426	2.87%
65+	211,084	2.56%

Step 22: Visualize Age Distribution

age_summary |>
  
  # Reorder age categories by population for better visual ranking
  mutate(age_category = fct_reorder(age_category, population)) |>
  
  # Create plot
  ggplot(aes(x = age_category, y = population, fill = population)) +
  
  # Bar chart
  geom_col(alpha = 0.9, show.legend = FALSE) +
  
  # Add text labels showing exact population
  geom_text(
    aes(label = comma(population)),
    hjust = -0.1,  # Position slightly outside the bar
    size = 3.5,
    fontface = "bold",
    color = "#06b6d4"
  ) +
  
  # Color gradient from dark to light
  scale_fill_gradient(low = "#000000", high = "#FFD700") +
  
  # Format y-axis and add space for text labels
  scale_y_continuous(
    labels = label_number(scale = 1e-3, suffix = "K"),
    expand = expansion(mult = c(0, 0.15))  # Add 15% space on right for labels
  ) +
  
  # Horizontal bars
  coord_flip() +
  
  # Labels
  labs(
    title = "**Population Distribution by Age Category**",
    subtitle = "South Sudan 2008 Census | National Summary",
    x = NULL,
    y = "Population"
  ) +
  
  # Theme adjustments
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.major.x = element_line(color = "#e5e5e5")
  )

Population distribution across age categories with exact counts labeled

Understanding scale_y_continuous()

expansion(mult = c(0, 0.15)) controls space around the plot:

First value (0): No extra space on the left
Second value (0.15): Add 15% extra space on the right

Why? To make room for our text labels showing exact population counts!

Without this, the labels would get cut off at the edge of the plot.

dplyr 1.2.0 & stringr 1.6.0: Migration Quick Reference

This tutorial showcased several new functions introduced in dplyr 1.2.0 and stringr 1.6.0. Here is a consolidated before/after reference for migrating your existing code:

Task	Before	After (2026)
Drop rows	`filter(x != "Total")`	`filter_out(x == "Total")`
Multi-column drop	`filter(a != "X", b != "Y")`	`filter_out(when_any(c(a, b), ~ . == "X"))`
Recode all values	`case_match(x, "a" ~ 1, "b" ~ 2)`	`recode_values(x, "a" ~ 1, "b" ~ 2)`
Recode from lookup	`age_map[x]` (named vector)	`recode_values(x, from = tbl$from, to = tbl$to)`
Replace few values	`if_else(x == "old", "new", x)`	`replace_values(x, "old" ~ "new")`
Conditional replace	`if_else(x > 5, 0, x)`	`replace_when(x, x > 5 ~ 0)`
To snake_case	`snakecase::to_snake_case(x)`	`str_to_snake(x)`
To camelCase	manual regex	`str_to_camel(x)`
To kebab-case	`gsub(" ", "-", tolower(x))`	`str_to_kebab(x)`
Case-insensitive LIKE	`str_detect(x, regex("pat", TRUE))`	`str_ilike(x, "pat")`

Deprecation Timeline

recode() — superseded since dplyr 1.1.0; migrate to recode_values() or replace_values()
case_match() — soft-deprecated in dplyr 1.2.0; migrate to recode_values()
str_like(ignore_case = TRUE) — deprecated in stringr 1.6.0; use str_ilike() instead

These old functions continue to work but will emit deprecation warnings. New code should use the replacements above.

Key Insights

What the Data Tells Us

1. Population Concentration

The top five states account for a significant portion of the national population. This geographic concentration should inform resource allocation decisions.

2. Youth Demographics

The age distribution reveals a young population—typical of developing nations. This “youth bulge” represents both: - Opportunity: Large workforce potential - Challenge: Need for education and employment infrastructure

3. Gender Balance

Most states show relatively balanced gender distributions, with some variation that may reflect: - Migration patterns - Conflict impacts
- Data collection methodology

4. Regional Disparities

Substantial population differences between states suggest the need for: - Differentiated development strategies - Targeted resource allocation - Context-specific policy interventions

Conclusion

Congratulations! You’ve completed a comprehensive demographic analysis using R and the tidyverse.

What You’ve Learned

Data Skills: ✅ Loading data from URLs with read_csv()
✅ Cleaning data with tidyverse functions
✅ String splitting and parsing with str_split_i(), str_remove(), and regex
✅ Extracting information from structured text (gender from “Population - Male (Number)”)
✅ Recategorizing data with case_when() and recode_values() for age groups ✅ Dropping rows with filter_out() (dplyr 1.2.0) ✅ Grouping and summarizing with group_by() and summarise()
✅ Reshaping data with pivot_wider() and pivot_longer()
✅ Calculating percentages and ratios

Visualization Skills: ✅ Creating pie charts and bar charts
✅ Customizing colors and themes
✅ Adding informative labels and titles
✅ Using coord_flip() for horizontal layouts
✅ Understanding Grammar of Graphics principles

Table Skills: ✅ Building professional tables with gt
✅ Formatting numbers and percentages
✅ Adding colors and styling
✅ Creating informative footnotes

String Processing Skills: ✅ Multiple methods for text extraction (split, regex, remove, separate) ✅ Using separate_wider_delim() to split into multiple columns ✅ Using str_split_i() to extract specific pieces ✅ Case conversion with str_to_camel(), str_to_snake(), str_to_kebab() (stringr 1.6.0) ✅ Understanding when to use each method ✅ Regular expressions for pattern matching

Workflow Skills: ✅ Using the pipe operator |> for readable code
✅ Writing clear, commented code
✅ Creating reproducible analyses
✅ Structuring code in logical steps

Next Steps for Learning

Beginner: 1. Practice with different datasets 2. Try modifying the colors and themes 3. Experiment with different chart types

Intermediate: 4. Learn about purrr for functional programming 5. Explore stringr for text manipulation 6. Study lubridate for date handling

Advanced: 7. Create interactive dashboards with Shiny 8. Build custom functions and packages 9. Contribute to open-source R projects

Resources: - R for Data Science - Free online book - RStudio Cheatsheets - Quick references - TidyTuesday - Weekly practice datasets

Alier Reng

Founder, Lead Educator & Creative Director at PyStatR+

Alier Reng is a Data Scientist, Educator, and Founder of PyStatR+, a platform advancing open and practical data science education. His work blends analytics, philosophy, and storytelling to make complex ideas human and empowering. Knowledge is freedom. Data is truth's language — ethics and transparency, its grammar.

Editor’s Note

This tutorial reflects a deliberate editorial balance between accessibility and technical depth. While R offers many approaches to data manipulation, this guide emphasizes the tidyverse philosophy—particularly dplyr 1.2.0 for data transformation and stringr 1.6.0 for text processing—because these tools prioritize readability and consistency.

This edition highlights dplyr 1.2.0’s filter_out() for clearer row-dropping semantics and recode_values() with lookup tables for maintainable value mappings. We also introduce stringr 1.6.0’s case conversion trio—str_to_camel(), str_to_snake(), and str_to_kebab()—for seamless interoperability with Python, JavaScript, and API naming conventions.

This approach aligns with the PyStatR+ Charter by emphasizing clarity, honesty, and accessibility without unnecessary complexity.

Acknowledgements

This lesson is part of the broader PyStatR+ Learning Platform, developed with gratitude to mentors, learners, and the open-source community that continually advances the R ecosystem. Special thanks to Hadley Wickham, the tidyverse team, and the contributors who make tools like dplyr, stringr, and ggplot2 possible.

References

R for Data Science (2nd Edition) — Wickham, Çetinkaya-Rundel, & Grolemund
dplyr 1.2.0 Release Notes — Tidyverse Blog
Recoding and Replacing Values — dplyr vignette
stringr 1.6.0 Release Notes — Tidyverse Blog
dplyr Documentation
stringr Documentation
ggplot2 Documentation
gt Package Documentation
South Sudan National Bureau of Statistics

PyStatR+ — Learning Simplified. Communication Amplified. 🚀

--- title: "Demographic Patterns in South Sudan: A Tidyverse Exploration" subtitle: "A Beginner's Guide to Data Analysis with R and the Tidyverse" author: "Alierwai Reng" date: "2024-11-20" categories: [R, Tidyverse, ggplot2, Demographics, Data Visualization] image: featured.png description: "Learn data analysis step-by-step using R's tidyverse!" format: html: code-fold: false code-tools: true toc: true toc-depth: 3 execute: warning: false message: false --- # Demographic Patterns in South Sudan: A Tidyverse Exploration ## Demographic Patterns in South Sudan: A Tidyverse Exploration > Learn data analysis step-by-step using R's tidyverse! ## Introduction {#sec-intro} # Demographic Patterns in South Sudan: A Tidyverse Exploration ## A Beginner's Guide to Data Analysis with R and the Tidyverse Learn data analysis step-by-step using R's tidyverse! This beginner-friendly tutorial explores South Sudan's demographics with clear explanations, beautiful visualizations, and professional tables. Welcome to this hands-on data analysis tutorial! This guide showcases **dplyr 1.2.0** for powerful data manipulation—including the new `filter_out()` and `recode_values()` functions—and introduces key **stringr 1.6.0** functions for cleaning and transforming text data, including the new case conversion trio: `str_to_camel()`, `str_to_snake()`, and `str_to_kebab()`. By the end of this guide, you'll understand how to: - **Load and explore** real-world census data - **Clean and transform** data using tidyverse functions - **Calculate** summary statistics and group-level metrics - **Create beautiful visualizations** with ggplot2 - **Build professional tables** with gt We will analyze South Sudan’s 2008 census data as a practical case study; however, the analytical techniques and workflows you will learn are fully transferable to any dataset across domains and contexts. The data were obtained from the National Bureau of Statistics, South Sudan, via the Open Data for Africa platform: Population by Age and Sex (2008) — http://southsudan.opendataforafrica.org/fvjqdpe/population-by-age-and-sex-2008-south-sudan ::: {.callout-tip} ## What is the Tidyverse? The **tidyverse** is a collection of R packages designed to work together seamlessly for data science. Think of it as a complete toolkit where all the tools fit perfectly in your hand. Key packages include: - **dplyr** — data manipulation (filter, mutate, select, group, summarize)—the focus of this tutorial - **stringr** — string manipulation (clean, extract, transform text)—introduced throughout this tutorial - **ggplot2** - data visualization (charts and graphs) - **tidyr** - data tidying (reshape and clean) - **readr** - data import (read CSV, Excel, etc.) These packages share a common design philosophy, making your code readable and your workflow intuitive! ::: ::: {.callout-note} ## Tested With R 4.4.x, dplyr 1.2.0, stringr 1.6.0, ggplot2 3.5.x, gt 0.11.x. ::: ::: {.callout-important} ## What's New in dplyr 1.2.0 & stringr 1.6.0 **dplyr 1.2.0** (released February 2026) introduces powerful new tools: - **[`filter_out()`](https://dplyr.tidyverse.org/reference/filter_out.html)** — the missing complement to `filter()`. Drop rows instead of keeping them, with cleaner boolean logic. - **[`recode_values()`](https://dplyr.tidyverse.org/reference/recode_values.html)** — create entirely new columns by mapping old values to new values. Replaces `case_match()` with a cleaner formula or `from`/`to` interface. - **[`replace_values()`](https://dplyr.tidyverse.org/reference/replace_values.html)** — partially update an existing column while preserving its type. - **[`replace_when()`](https://dplyr.tidyverse.org/reference/replace_when.html)** — conditionally replace rows within columns, a type-stable alternative to `if_else()`. - **[`when_any()`](https://dplyr.tidyverse.org/reference/when_any.html)** and **`when_all()`** — elementwise OR/AND helpers for multi-column conditions. **stringr 1.6.0** (released November 2025) adds: - **`str_to_camel()`**, **`str_to_snake()`**, **`str_to_kebab()`** — convert between programming case conventions. - **`str_ilike()`** — case-insensitive SQL-like pattern matching. We'll showcase several of these throughout this tutorial! ::: --- ## Part 1: Environment Setup {#sec-setup} ### Step 1: Load Required Packages Every R analysis starts by loading the libraries (packages) we need. Think of packages as apps on your phone—each one adds specific capabilities. ```{r} #| label: load-packages #| code-summary: "Load tidyverse and visualization packages" # Core tidyverse packages (loads dplyr, ggplot2, tidyr, readr, and more!) library(tidyverse) library(janitor) # Table formatting packages library(gt) # Grammar of Tables - for beautiful tables library(gtExtras) # Extra features for gt tables # Visualization enhancement packages library(ggtext) # Rich text formatting in ggplot2 library(scales) # Scale functions for axes and labels library(glue) # Easy string interpolation # Confirmation message cat("✅ All packages loaded successfully!\n") cat("📦 Tidyverse version:", as.character(packageVersion("tidyverse")), "\n") ``` ::: {.callout-note} ## Package Installation If needed, install with: `install.packages(c("tidyverse", "gt", "gtExtras", "ggtext", "scales", "glue"))` ::: ### Step 2: Configure Visualization Theme ```{r} #| label: setup-theme #| code-summary: "Configure default ggplot2 theme" theme_set( theme_minimal(base_size = 13, base_family = "sans") + theme( plot.title = element_markdown(size = 16, face = "bold", color = "#06b6d4", margin = margin(b = 10)), plot.subtitle = element_markdown(size = 12, color = "#666666", margin = margin(b = 15)), plot.caption = element_markdown(size = 9, color = "#999999", hjust = 0), panel.grid.minor = element_blank(), panel.grid.major = element_line(color = "#e5e5e5", linewidth = 0.3), legend.position = "top", legend.title = element_text(face = "bold", size = 11), axis.title = element_text(face = "bold", size = 11) ) ) cat("🎨 Custom theme configured!\n") ``` --- ## Part 2: Loading and Exploring Data {#sec-load} ### Step 3: Load Census Data from URL ```{r} #| label: load-data #| code-summary: "Load census data from GitHub" url <- "https://raw.githubusercontent.com/tongakuot/r_tutorials/main/06-data-wrangling/00-input/ss_2008_census_data_raw.csv" census_raw <- read_csv(url, show_col_types = FALSE) cat("✅ Data loaded successfully!\n") cat("📋 Dimensions:", nrow(census_raw), "rows ×", ncol(census_raw), "columns\n") ``` ### Step 4: Examine the Data Structure ```{r} #| label: examine-structure #| code-summary: "View data structure with glimpse()" glimpse(census_raw) ``` ### Step 5: Preview the Data ```{r} #| label: preview-data #| code-summary: "View first 10 rows" # head() shows the first n rows census_raw |> head(10) ``` --- ## Part 3: Data Cleaning {#sec-clean} Raw data is rarely analysis-ready. We need to clean and standardize it first! ::: {.callout-note} ## Introducing stringr for Text Cleaning The **stringr** package (part of tidyverse) provides consistent, intuitive functions for string manipulation. In this section, we'll use several key functions: | Function | Purpose | Example | |----------|---------|---------| | `str_to_lower()` | Convert to lowercase | "HELLO" → "hello" | | `str_to_upper()` | Convert to UPPERCASE | "hello" → "HELLO" | | `str_to_title()` | Convert to Title Case | "hello world" → "Hello World" | | `str_to_sentence()` | Sentence case | "hello world" → "Hello world" | | `str_to_camel()` | Convert to camelCase | "quick brown fox" → "quickBrownFox" | | `str_to_snake()` | Convert to snake_case | "Quick Brown Fox" → "quick_brown_fox" | | `str_to_kebab()` | Convert to kebab-case | "Quick Brown Fox" → "quick-brown-fox" | | `str_squish()` | Remove extra whitespace | " hello world " → "hello world" | | `str_replace_all()` | Replace patterns | "a b c" → "a_b_c" | The first seven are case conversion functions. The trio `str_to_camel()`, `str_to_snake()`, and `str_to_kebab()` are **new in stringr 1.6.0** and convert between programming naming conventions—essential when bridging R data with Python, JavaScript, or API outputs. ::: #### stringr 1.6.0: Case Conversion in Action Before stringr 1.6.0, converting between camelCase, snake_case, and kebab-case required manual regex or external packages like `snakecase`. Now these conversions are built in: ```{r} #| label: stringr-case-demo #| code-summary: "Demonstrate new stringr 1.6.0 case conversion functions" #| eval: false # New in stringr 1.6.0: convert between programming case conventions demo_text <- "south sudan census data" cat("Original: ", demo_text, "\n") cat("camelCase: ", str_to_camel(demo_text), "\n") cat("PascalCase: ", str_to_camel(demo_text, first_upper = TRUE), "\n") cat("snake_case: ", str_to_snake(demo_text), "\n") cat("kebab-case: ", str_to_kebab(demo_text), "\n") cat("Title Case: ", str_to_title(demo_text), "\n") ``` ::: {.callout-important} ## Transition: Old → New Case Conversion **Before stringr 1.6.0** — manual regex or external package: ```r # Required snakecase package or manual work snakecase::to_snake_case("South Sudan Census") gsub(" ", "-", tolower("South Sudan Census")) ``` **After stringr 1.6.0** — native, consistent, pipe-friendly: ```r # Built into stringr — works seamlessly in tidyverse pipelines "South Sudan Census" |> str_to_snake() # "south_sudan_census" "South Sudan Census" |> str_to_kebab() # "south-sudan-census" "South Sudan Census" |> str_to_camel() # "southSudanCensus" ``` These are particularly valuable when bridging R data with Python (`snake_case`), JavaScript (`camelCase`), or URL slugs (`kebab-case`). ::: ### Step 6: Clean and Transform the Dataset Here's a practical cleaning pipeline using the `janitor` package for automatic name standardization: ```{r} #| label: clean-data #| code-summary: "Complete data cleaning pipeline" #| code-line-numbers: true census_clean <- census_raw |> # Standardize column names automatically clean_names() |> # Rename to meaningful names select( state = region_name, gender = variable_name, age_category = age_name, population = x2008 ) |> # Clean text columns and convert population to integer mutate( across(where(is.character), \(x) str_squish(x) |> str_to_title()), population = as.integer(population) ) |> # Remove rows with missing or invalid data filter(!is.na(population), population > 0) # Display results cat("✅ Data cleaning complete!\n") cat("📊 Cleaned dataset:", nrow(census_clean), "rows ×", ncol(census_clean), "columns\n") ``` ::: {.callout-tip icon="true"} ## The Cleaning Pipeline Explained **Key steps:** - **`clean_names()`**: Automatically standardizes column names (lowercase, underscores, safe for R) - **`select()`**: Chooses columns and renames them simultaneously - **`across(where(is.character), ...)`**: Applies the same transformation to all text columns at once - **`str_squish()` + `str_to_title()`**: Removes extra spaces and capitalizes properly - **`as.integer()`**: Converts population to whole numbers (appropriate for count data) - **`filter()`**: Removes rows with missing or invalid population values This single approach combines efficiency with clarity! ::: ### Step 7: Preview the Cleaned Data ```{r} #| label: preview-cleaned #| code-summary: "View cleaned data" census_clean |> head(10) ``` --- ## Part 3B: Advanced String Processing {#sec-string-processing} Now we need to extract more meaningful information from our data. The `gender` column actually contains structured text like "Population - Male (Number)" that we can parse! ### Step 8: Examine the Gender Column Structure ```{r} #| label: examine-gender #| code-summary: "Explore gender column values" # See unique values in gender column cat("🔍 Unique values in gender column:\n") census_clean |> distinct(gender) |> pull(gender) ``` The gender column follows a pattern: "Population - Gender (Type)" where we need to extract just the gender value. ### Step 9: Extract Gender Information The `gender` column contains structured text like "Population - Male (Number)". We'll extract just the gender value using the most practical approaches. #### Method 1: Using str_split_i() — Simple and Direct Split the text and extract the piece you need: ```{r} #| label: gender-method-1 #| code-summary: "Method 1: Extract with str_split_i()" # Apply gender extraction to our dataset census_parsed <- census_clean |> mutate( # Split on " ", extract 2nd piece, then remove "(Number)" text gender = str_split_i(gender, " ", 2) |> str_remove(" \$.*\$") |> str_squish() ) # Verify extraction worked cat("✅ Gender extraction complete!\n") cat("🎯 Unique gender values:\n") census_parsed |> distinct(gender) |> pull(gender) ``` ::: {.callout-tip} ## How This Works - **`str_split_i(gender, " ", 2)`**: Split on spaces, extract the 2nd piece - "Population - Male (Number)" → "- Male (Number)" - **`str_remove(" \$.*\$")`**: Remove " (anything)" pattern - "- Male (Number)" → "- Male" - **`str_squish()`**: Clean extra whitespace - "- Male" → "Male" This is the most practical approach for extraction tasks! ::: #### Method 2: Using separate_wider_delim() — When You Need Multiple Pieces If you need to keep multiple pieces from a split, use `separate_wider_delim()`: ```{r} #| label: gender-method-2-separate #| code-summary: "Method 2: separate_wider_delim() for multiple columns" # Demonstrate separate_wider_delim() demo_separate <- census_clean |> select(original_gender = gender) |> separate_wider_delim( cols = original_gender, delim = " ", names = c("prefix", "gender_raw", "suffix"), too_few = "align_start", cols_remove = FALSE ) |> mutate(gender_clean = str_remove(gender_raw, " \$.*\$")) |> select(original_gender, prefix, gender_raw, gender_clean) |> distinct() |> head(3) cat("✅ separate_wider_delim() keeps all pieces:\n") demo_separate ``` ::: {.callout-note} ## When to Use Each Method **`str_split_i()` (Method 1):** - You only need one piece - More concise code - Recommended for this task **`separate_wider_delim()` (Method 2):** - You need multiple pieces as separate columns - Better for data reshaping workflows - Excellent when working with structured delimited text **For our case:** We use Method 1 because we only extract the gender value. ::: #### Method 3: Regular Expressions — For Complex Patterns For more complex text patterns, use regex with `str_extract()`: ```{r} #| label: gender-method-3-regex #| code-summary: "Method 3: Regex pattern matching" # Regex approach (for reference/learning) demo_regex <- census_clean |> select(original_gender = gender) |> mutate( # Regex: text between " " and " (" gender_regex = str_extract(original_gender, "(?<= ).*(?= \\()") ) |> distinct() |> head(3) cat("✅ Regex extraction with lookaround assertions:\n") demo_regex ``` ::: {.callout-note} ## Regex Lookaround Basics `"(?<= ).*(?= \\()"` extracts text between two patterns: - `(?<= )` — Lookbehind: preceded by space-dash-space - `.*` — Match: any characters - `(?= \\()` — Lookahead: followed by " (" **Result:** Extracts "Male" from "Population - Male (Number)" Additional approaches (regex lookahead, separate + unite, etc.) are covered in the [stringr documentation](https://stringr.tidyverse.org/). ::: ### Step 10: Recategorize Age Groups Let's group the fine-grained 5-year age bands into broader, more interpretable categories: ```{r} #| label: examine-age-categories #| code-summary: "View current age categories" cat("🔍 Current age categories:\n") census_parsed |> distinct(age_category) |> arrange(age_category) |> pull(age_category) ``` #### Method 1: Using case_when() for Conditional Recategorization The `case_when()` function is perfect for complex, multi-condition transformations: ```{r} #| label: age-method-1 #| code-summary: "Method 1: case_when() for age recategorization" #| code-line-numbers: true census_final <- census_parsed |> mutate( age_category = str_to_lower(age_category), age_category = case_when( # Children (0-14) age_category %in% c("0 to 4", "5 to 9", "10 to 14") ~ "0-14", # Youth (15-24) age_category %in% c("15 to 19", "20 to 24") ~ "15-24", # Early working age (25-34) age_category %in% c("25 to 29", "30 to 34") ~ "25-34", # Middle working age (35-44) age_category %in% c("35 to 39", "40 to 44") ~ "35-44", # Later working age (45-54) age_category %in% c("45 to 49", "50 to 54") ~ "45-54", # Pre-retirement (55-64) age_category %in% c("55 to 59", "60 to 64") ~ "55-64", # Retirement age (65+) age_category == "65+" ~ "65+", # Catch any unexpected values TRUE ~ age_category ) ) # Verify recategorization cat("✅ Age categories recategorized!\n") cat("🎯 New age categories:\n") census_final |> distinct(age_category) |> arrange(age_category) |> pull(age_category) ``` ::: {.callout-important icon="true"} ## Understanding case_when() `case_when()` is like a multi-way IF statement (similar to SQL's CASE WHEN): **Structure:** ```r case_when( condition1 ~ result1, # If condition1 is TRUE, return result1 condition2 ~ result2, # Else if condition2 is TRUE, return result2 condition3 ~ result3, # Else if condition3 is TRUE, return result3 TRUE ~ default # Else return default (catch-all) ) ``` **Key points:** - Conditions are evaluated **in order** (first match wins) - `%in%` checks if value is in a vector (like "is one of") - `~` separates condition from result - `TRUE ~` at the end catches anything not matched above **Example for our data:** - If age is "0 to 4" OR "5 to 9" OR "10 to 14" → return "0-14" - Else if age is "15 to 19" OR "20 to 24" → return "15-24" - And so on... ::: #### Method 2: Using recode_values() — New in dplyr 1.2.0 For value-to-value mappings, the new `recode_values()` is cleaner than `case_when()`. It accepts a `from`/`to` lookup table or formula syntax: ```{r} #| label: age-method-2-recode-values #| code-summary: "Method 2: recode_values() with lookup table (dplyr 1.2.0)" #| eval: true # Build a lookup table — clean and portable age_lookup <- tribble( ~from, ~to, "0 to 4", "0-14", "5 to 9", "0-14", "10 to 14", "0-14", "15 to 19", "15-24", "20 to 24", "15-24", "25 to 29", "25-34", "30 to 34", "25-34", "35 to 39", "35-44", "40 to 44", "35-44", "45 to 49", "45-54", "50 to 54", "45-54", "55 to 59", "55-64", "60 to 64", "55-64", "65+", "65+" ) # Demonstrate recode_values() (not applied to dataset) demo_recode <- census_parsed |> select(age_category) |> mutate( age_category = str_to_lower(age_category), age_recoded = recode_values( age_category, from = age_lookup$from, to = age_lookup$to ) ) |> distinct() cat("✅ recode_values() demonstration:\n") demo_recode ``` ::: {.callout-tip} ## Comparing case_when() vs recode_values() **Use `case_when()` when:** - Multiple conditions per category (AND/OR logic) - Complex conditional expressions - **Best when conditions aren't simple equality checks** **Use `recode_values()` (dplyr 1.2.0) when:** - Simple value-to-value mappings - Lookup table stored externally (CSV, tribble) - Cleaner syntax for 1-to-1 replacements - **Replaces the superseded `case_match()` and `recode()`** Both work for our use case. We use `case_when()` above because it groups related age bands together, making the logic visible. For production code with many mappings, `recode_values()` with a lookup table is more maintainable. ::: ::: {.callout-important} ## Transition: Old → New Recoding Functions **Superseded — `recode()` (avoid in new code):** ```r recode(x, "0 to 4" = "0-14", "5 to 9" = "0-14", .default = x) ``` **Soft-deprecated — `case_match()` (migrate to `recode_values()`):** ```r case_match(x, c("0 to 4", "5 to 9", "10 to 14") ~ "0-14", .default = x) ``` **New — `recode_values()` with formula syntax:** ```r recode_values(x, "0 to 4" ~ "0-14", "5 to 9" ~ "0-14", default = x) ``` **New — `recode_values()` with lookup table (recommended for many mappings):** ```r recode_values(x, from = age_lookup$from, to = age_lookup$to) ``` **New — `replace_values()` for partial updates (preserves column type):** ```r replace_values(x, "Total" ~ "All Genders") ``` The `recode_values()` function replaces both `recode()` and `case_match()`. Use `replace_values()` when you only need to change a few values while keeping the rest intact. ::: ## Step 11: Filter and Verify Final Dataset Remove aggregate rows before analysis. In **dplyr 1.2.0**, the new `filter_out()` makes this intent explicit—you specify what to *drop* rather than negate conditions: ```{r} #| label: filter-out-demo #| code-summary: "Drop aggregate rows before analysis" # Remove rows where gender or age_category is "Total" # In dplyr 1.2.0, you can use: filter_out(gender == "Total" | age_category == "Total") census_filtered <- census_final |> filter_out(gender == "Total" | age_category == "total") census_filtered ``` ::: {.callout-important} ## Transition: Old → New Row Filtering **Before dplyr 1.2.0 — negated logic with `filter()`:** ```r # Awkward double negation — "keep rows where gender is NOT Total AND age is NOT Total" census_final |> filter(gender != "Total", age_category != "Total") ``` **After dplyr 1.2.0 — direct intent with `filter_out()`:** ```r # Clear intent — "drop rows where gender IS Total OR age IS Total" census_final |> filter_out(gender == "Total" | age_category == "total") ``` **Why this matters:** - `filter()` keeps rows that match → forces negated logic to drop rows (`!=`, `!`) - `filter_out()` drops rows that match → write positive conditions, cleaner boolean logic - `filter_out()` handles `NA` values more predictably (rows with `NA` are kept, not silently dropped) **Bonus — `when_any()` and `when_all()` for multi-column conditions:** ```r # Drop rows where ANY column equals "Total" (works across many columns!) census_final |> filter_out(when_any(c(gender, age_category), ~ . == "Total")) ``` These helpers compose naturally with both `filter()` and `filter_out()`, making multi-column conditions readable. ::: Let's confirm all transformations worked correctly: ```{r} #| label: verify-final #| code-summary: "Verify final cleaned dataset" # Generate a comprehensive summary of the cleaned and transformed dataset # This verification step ensures all transformations were applied correctly cat(strrep("=", 50), "\n", sep = "") cat("🎉 DATA TRANSFORMATION COMPLETE!\n") cat(strrep("=", 50), "\n", sep = "") # Display dimensions of the final cleaned dataset cat("📊 Final dataset dimensions:\n") cat(" Rows:", nrow(census_filtered), "\n") cat(" Columns:", ncol(census_filtered), "\n\n") # List all column names in the cleaned dataset cat("✅ Column names:\n") cat(" ", paste(names(census_final), collapse = ", "), "\n\n") # Show all unique gender categories (should only be "Male", "Female", and "Total" before filtering) cat("🎯 Unique gender values:\n") census_filtered |> distinct(gender) |> pull(gender) |> cat(" ", "\n") # Show all unique age categories in sorted order cat("\n🎯 Unique age categories:\n") census_filtered |> distinct(age_category) |> arrange(age_category) |> pull(age_category) |> cat(" ", "\n") # Display first 10 rows of key columns to visually inspect the data cat("\n📋 Sample of final data:\n") census_filtered |> select(state, gender, age_category, population) |> head(10) ``` ::: {.callout-note} ## Data Transformation Summary **What we accomplished:** **Original `gender` column values:** - "Population - Male (Number)" - "Population - Female (Number)" - "Population - Total (Number)" **Transformed to clean format:** - "Male" - "Female" - "Total" **Original `age_category` column structure:** - 16 individual 5-year age bands - Examples: "0 To 4", "5 To 9", "10 To 14", ..., "65+" **Transformed to standardized life-stage categories:** - 7 broader, more interpretable age groups - "0-14", "15-24", "25-34", "35-44", "45-54", "55-64", "65+" **Result:** A clean, standardized dataset ready for analysis and visualization! ✨ ::: --- ## Part 4: Data Exploration and Summary {#sec-explore} ### Step 12: Create Overview Statistics Let's calculate some key statistics about our dataset: ```{r} #| label: overview-stats #| code-summary: "Calculate summary statistics" # Create a summary table overview_table <- census_filtered |> summarise( `Total Population` = comma(sum(population)), # Format with commas `Number of States` = n_distinct(state), # Count unique states `Age Categories` = n_distinct(age_category), # Count unique ages `Gender Groups` = n_distinct(gender), # Count unique genders `Total Observations` = comma(n()) # Count all rows ) # Display the summary overview_table ``` ::: {.callout-note} ## Understanding summarise() `summarise()` collapses data into summary statistics: - `sum()` - adds up values - `n_distinct()` - counts unique values - `n()` - counts total rows - `comma()` - formats numbers with commas (from scales package) It reduces many rows into one row of summaries! ::: ### Step 13: Display as Professional Table Now let's make this summary look professional using the `gt` package: ```{r} #| label: overview-table-styled #| code-summary: "Create styled table with gt" overview_table |> gt() |> tab_header( title = md("**South Sudan 2008 Census Overview**"), subtitle = "Key Summary Statistics" ) |> tab_style( style = cell_fill(color = "#22d3ee"), locations = cells_body() ) |> tab_style( style = cell_text(color = "white", weight = "bold"), locations = cells_body() ) |> tab_options( table.font.size = px(13), heading.title.font.size = px(16), heading.subtitle.font.size = px(12) ) ``` ::: {.callout-tip icon="true"} ## The Grammar of Tables (gt) The `gt` package uses a **layered approach** (like ggplot2 for tables): 1. **Start with data** → `gt()` 2. **Add headers** → `tab_header()` 3. **Style cells** → `tab_style()` 4. **Format numbers** → `fmt_number()` 5. **Adjust options** → `tab_options()` Each layer adds or modifies the table appearance! ::: --- ## Part 5: Gender Analysis {#sec-gender} ### Step 14: Calculate National Gender Distribution Let's analyze how the population is distributed by gender: ```{r} #| label: gender-summary #| code-summary: "Calculate gender distribution" #| code-line-numbers: true gender_summary <- census_filtered |> # Step 1: Group data by gender group_by(gender) |> # Step 2: Calculate total population for each gender summarise( population = sum(population), .groups = "drop" # Remove grouping after summarise ) |> # Step 3: Calculate percentages mutate( percentage = population / sum(population) * 100, percentage_label = percent(percentage / 100, accuracy = 0.01) ) |> # Step 4: Sort by population (largest first) arrange(desc(population)) # Display the results gender_summary ``` ::: {.callout-important} ## Understanding group_by() and summarise() These two functions work together like a team: **`group_by(gender)`** - Splits data into groups (one for Male, one for Female) - Like separating cards into piles **`summarise(population = sum(population))`** - Performs calculations within each group - `sum()` adds up all population values in each group - Like counting cards in each pile **`.groups = "drop"`** - Removes the grouping after we're done - Prevents unexpected behavior in future operations **Final result:** One row per gender with total population! ::: ### Step 15: Display Gender Table ```{r} #| label: gender-table #| code-summary: "Create styled gender distribution table" gender_summary |> # Rename columns for display select( Gender = gender, Population = population, `Percentage` = percentage_label ) |> # Create gt table gt() |> # Add title and subtitle tab_header( title = md("**National Gender Distribution**"), subtitle = "South Sudan 2008 Census" ) |> # Format population with commas fmt_number( columns = Population, decimals = 0, use_seps = TRUE ) |> # Style Male row (first row) - cyan background tab_style( style = list( cell_fill(color = "#22d3ee"), cell_text(color = "white", weight = "bold") ), locations = cells_body(rows = 1) ) |> # Style Female row (second row) - gold background tab_style( style = list( cell_fill(color = "#FFD700"), cell_text(color = "#000000", weight = "bold") ), locations = cells_body(rows = 2) ) |> # Center all columns cols_align(align = "center", columns = everything()) |> # Adjust font sizes tab_options( table.font.size = px(13), heading.title.font.size = px(16) ) ``` ### Step 16: Visualize Gender Distribution Numbers are great, but visualizations make patterns instantly clear. Let's create a pie chart: ```{r} #| label: gender-viz #| code-summary: "Create pie chart for gender distribution" #| fig-width: 10 #| fig-height: 6 #| fig-cap: "Gender distribution shown as a pie chart with population counts and percentages" ggplot(gender_summary, aes(x = "", y = population, fill = gender)) + # Create a bar chart (we'll turn it into a pie) geom_col(width = 1, color = "white", linewidth = 2) + # Convert bar chart to pie chart using polar coordinates coord_polar(theta = "y") + # Set custom colors for Male and Female scale_fill_manual( values = c( "Male" = "#22d3ee", "Female" = "#FFD700" ) ) + # Add labels showing counts and percentages geom_text( aes(label = glue("{comma(population)}\n({percentage_label})")), position = position_stack(vjust = 0.5), # Center in each slice size = 5, fontface = "bold", color = "white" ) + # Add titles and labels labs( title = "**Gender Distribution in South Sudan**", subtitle = "2008 Census Data", fill = "Gender" ) + # Use void theme for pie charts (removes axes) theme_void() + # Customize title and legend theme( plot.title = element_markdown( size = 16, face = "bold", color = "#06b6d4", hjust = 0.5, # Center title margin = margin(b = 5) ), plot.subtitle = element_markdown( size = 12, color = "#666666", hjust = 0.5 # Center subtitle ), legend.position = "bottom", legend.title = element_text(face = "bold", size = 11) ) ``` ::: {.callout-tip} ## Anatomy of a ggplot2 Chart Every ggplot2 visualization follows this pattern: 1. **Start with data** → `ggplot(data, aes(...))` 2. **Add geometry** → `geom_*()` (point, line, bar, etc.) 3. **Adjust scales** → `scale_*()` (colors, axes, etc.) 4. **Add labels** → `labs()` (title, axes, etc.) 5. **Apply theme** → `theme_*()` (appearance) Think of it like building with LEGO blocks—each layer adds something! **Bonus:** `coord_polar()` transforms rectangular plots into circular ones (bar chart → pie chart)! ::: ### Step 17: Gender Distribution by State Now let's see how gender distribution varies across different states: ```{r} #| label: state-gender-analysis #| code-summary: "Calculate state-level gender statistics" state_gender <- census_filtered |> # Group by both state AND gender group_by(state, gender) |> # Sum population within each state-gender combination summarise(population = sum(population), .groups = "drop") |> # Reshape from long to wide format # Before: Multiple rows per state (one for Male, one for Female) # After: One row per state (Male and Female as separate columns) pivot_wider(names_from = gender, values_from = population) |> # Calculate additional metrics mutate( total = Male + Female, # Total population male_pct = Male / total * 100, # Male percentage female_pct = Female / total * 100, # Female percentage gender_ratio = Male / Female * 100 # Males per 100 females ) |> # Sort by total population (largest first) arrange(desc(total)) # Display top 5 states state_gender |> head(5) ``` ::: {.callout-note} ## Understanding pivot_wider() `pivot_wider()` reshapes data from **long** to **wide** format: **Before (Long format):** ``` State Gender Population Jonglei Male 734327 Jonglei Female 624275 Warrap Male 470734 Warrap Female 502194 ``` **After (Wide format):** ``` State Male Female Total Jonglei 734327 624275 1358602 Warrap 470734 502194 972928 ``` Why? Because it's easier to calculate ratios and percentages when Male and Female are in separate columns! ::: ### Step 18: Display State Gender Table ```{r} #| label: state-gender-table #| code-summary: "Create styled state gender table" state_gender |> head(5) |> # Select and rename columns for display select( State = state, Male, Female, Total = total, `Male %` = male_pct, `Female %` = female_pct, `Gender Ratio` = gender_ratio ) |> # Create table gt(rowname_col = "State") |> cols_align(columns = State, align = "right") |> # Add header tab_header( title = md("**Gender Distribution by State**"), subtitle = "Top 10 Most Populous States" ) |> # Format population columns with commas fmt_number( columns = c(Male, Female, Total), decimals = 0, use_seps = TRUE ) |> # Format percentage and ratio columns fmt_number( columns = c(`Male %`, `Female %`, `Gender Ratio`), decimals = 2 ) |> # Add color gradient to Gender Ratio # Values near 100 are balanced (white) # Values far from 100 show imbalance (colored) data_color( columns = `Gender Ratio`, palette = c("#FFD700", "#ffffff", "#22d3ee"), domain = c(90, 120) ) |> # Highlight State column tab_style( style = cell_fill(color = "#f8f9fa"), locations = cells_body(columns = State) ) |> # Add footnote explaining Gender Ratio tab_footnote( footnote = "Gender Ratio represents males per 100 females. Values near 100 indicate balance.", locations = cells_column_labels(columns = `Gender Ratio`) ) |> # Apply pre-built theme gt_theme_538(quiet = TRUE) |> # Adjust font sizes tab_options( table.font.size = px(12), heading.title.font.size = px(16), footnotes.font.size = px(10) ) ``` ::: {.callout-important} ## Understanding Gender Ratio **Gender Ratio** = (Males / Females) × 100 - **Ratio = 100**: Perfect balance (equal males and females) - **Ratio > 100**: More males than females - **Ratio < 100**: More females than males For example: - Ratio of 105 means 105 males per 100 females (5% more males) - Ratio of 95 means 95 males per 100 females (5% fewer males) ::: ### Step 19: Visualize State Gender Distribution ```{r} #| label: state-gender-viz #| code-summary: "Create grouped bar chart by state and gender" #| fig-width: 12 #| fig-height: 8 #| fig-cap: "Population by state and gender for the top 5 most populous states" state_gender |> head(5) |> # Convert from wide to long format for plotting # Need separate rows for Male and Female to create grouped bars pivot_longer( cols = c(Male, Female), names_to = "gender", values_to = "population" ) |> # Reorder states by total population for better visualization mutate(state = fct_reorder(state, total)) |> # Create plot ggplot(aes(x = state, y = population, fill = gender)) + # Grouped bar chart (bars side by side) geom_col(position = "dodge", alpha = 0.9, width = 0.7) + # Set colors scale_fill_manual( values = c( "Male" = "#22d3ee", "Female" = "#FFD700" ) ) + # Format y-axis labels (show as "100K" instead of "100000") scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "K")) + # Flip coordinates (horizontal bars are easier to read) coord_flip() + # Add labels labs( title = "**Population by State and Gender**", subtitle = "Top 5 Most Populous States | South Sudan 2008 Census", x = NULL, # Remove x-axis label (it says "state" which is obvious) y = "Population", fill = "Gender" ) + # Customize theme theme( panel.grid.major.y = element_blank(), # Remove horizontal grid lines panel.grid.major.x = element_line(color = "#e5e5e5"), legend.position = "top" ) ``` ::: {.callout-tip} ## Choosing the Right Chart Type **Grouped Bar Chart** (what we used): - Best for: Comparing categories across groups - Shows: Exact values for each category - Advantage: Easy to compare Male vs Female within each state **Stacked Bar Chart** (alternative): - Best for: Showing part-to-whole relationships - Shows: Total and composition - Advantage: Shows total population at a glance **Why coord_flip()?** Long state names are easier to read horizontally than at an angle! ::: --- ## Part 6: Age Category Analysis {#sec-age} ### Step 20: Calculate National Age Distribution ```{r} #| label: age-summary #| code-summary: "Calculate age category distribution" age_summary <- census_filtered |> # Group by age category group_by(age_category) |> # Sum population for each age group summarise( population = sum(population), .groups = "drop" ) |> # Calculate percentages mutate( percentage = population / sum(population) * 100, percentage_label = percent(percentage / 100, accuracy = 0.01) ) |> # Sort by population (largest first) arrange(desc(population)) # Display results age_summary ``` ### Step 21: Display Age Distribution Table ```{r} #| label: age-table #| code-summary: "Create styled age distribution table" age_summary |> select( `Age Category` = age_category, Population = population, Percentage = percentage_label ) |> # Create table gt() |> # Add header tab_header( title = md("**Population Distribution by Age Category**"), subtitle = "National Summary" ) |> # Format population with commas fmt_number( columns = Population, decimals = 0, use_seps = TRUE ) |> # Add color gradient based on population size data_color( columns = Population, palette = c("#000000", "#0891b2", "#22d3ee", "#FFD700") ) |> # Make text white on colored backgrounds tab_style( style = cell_text(color = "white", weight = "bold"), locations = cells_body(columns = Population) ) |> # Add vertical divider between columns gt_add_divider(columns = `Age Category`, color = "#e5e5e5") |> # Adjust font sizes tab_options( table.font.size = px(13), heading.title.font.size = px(16) ) ``` ### Step 22: Visualize Age Distribution ```{r} #| label: age-viz #| code-summary: "Create horizontal bar chart for age distribution" #| fig-width: 12 #| fig-height: 6 #| fig-cap: "Population distribution across age categories with exact counts labeled" age_summary |> # Reorder age categories by population for better visual ranking mutate(age_category = fct_reorder(age_category, population)) |> # Create plot ggplot(aes(x = age_category, y = population, fill = population)) + # Bar chart geom_col(alpha = 0.9, show.legend = FALSE) + # Add text labels showing exact population geom_text( aes(label = comma(population)), hjust = -0.1, # Position slightly outside the bar size = 3.5, fontface = "bold", color = "#06b6d4" ) + # Color gradient from dark to light scale_fill_gradient(low = "#000000", high = "#FFD700") + # Format y-axis and add space for text labels scale_y_continuous( labels = label_number(scale = 1e-3, suffix = "K"), expand = expansion(mult = c(0, 0.15)) # Add 15% space on right for labels ) + # Horizontal bars coord_flip() + # Labels labs( title = "**Population Distribution by Age Category**", subtitle = "South Sudan 2008 Census | National Summary", x = NULL, y = "Population" ) + # Theme adjustments theme( panel.grid.major.y = element_blank(), panel.grid.major.x = element_line(color = "#e5e5e5") ) ``` ::: {.callout-note} ## Understanding scale_y_continuous() **`expansion(mult = c(0, 0.15))`** controls space around the plot: - First value (0): No extra space on the left - Second value (0.15): Add 15% extra space on the right Why? To make room for our text labels showing exact population counts! Without this, the labels would get cut off at the edge of the plot. ::: --- ## dplyr 1.2.0 & stringr 1.6.0: Migration Quick Reference {#sec-migration} This tutorial showcased several new functions introduced in **dplyr 1.2.0** and **stringr 1.6.0**. Here is a consolidated before/after reference for migrating your existing code: | Task | Before | After (2026) | |------|--------|--------------| | **Drop rows** | `filter(x != "Total")` | `filter_out(x == "Total")` | | **Multi-column drop** | `filter(a != "X", b != "Y")` | `filter_out(when_any(c(a, b), ~ . == "X"))` | | **Recode all values** | `case_match(x, "a" ~ 1, "b" ~ 2)` | `recode_values(x, "a" ~ 1, "b" ~ 2)` | | **Recode from lookup** | `age_map[x]` (named vector) | `recode_values(x, from = tbl$from, to = tbl$to)` | | **Replace few values** | `if_else(x == "old", "new", x)` | `replace_values(x, "old" ~ "new")` | | **Conditional replace** | `if_else(x > 5, 0, x)` | `replace_when(x, x > 5 ~ 0)` | | **To snake_case** | `snakecase::to_snake_case(x)` | `str_to_snake(x)` | | **To camelCase** | manual regex | `str_to_camel(x)` | | **To kebab-case** | `gsub(" ", "-", tolower(x))` | `str_to_kebab(x)` | | **Case-insensitive LIKE** | `str_detect(x, regex("pat", TRUE))` | `str_ilike(x, "pat")` | ::: {.callout-note} ## Deprecation Timeline - **`recode()`** — superseded since dplyr 1.1.0; migrate to `recode_values()` or `replace_values()` - **`case_match()`** — soft-deprecated in dplyr 1.2.0; migrate to `recode_values()` - **`str_like(ignore_case = TRUE)`** — deprecated in stringr 1.6.0; use `str_ilike()` instead These old functions continue to work but will emit deprecation warnings. New code should use the replacements above. ::: --- ## Key Insights {#sec-insights} ::: {.callout-important icon="true"} ## What the Data Tells Us ### 1. **Population Concentration** The top five states account for a significant portion of the national population. This geographic concentration should inform resource allocation decisions. ### 2. **Youth Demographics** The age distribution reveals a **young population**—typical of developing nations. This "youth bulge" represents both: - **Opportunity**: Large workforce potential - **Challenge**: Need for education and employment infrastructure ### 3. **Gender Balance** Most states show **relatively balanced** gender distributions, with some variation that may reflect: - Migration patterns - Conflict impacts - Data collection methodology ### 4. **Regional Disparities** Substantial population differences between states suggest the need for: - Differentiated development strategies - Targeted resource allocation - Context-specific policy interventions ::: --- ## Conclusion {#sec-conclusion} Congratulations! You've completed a comprehensive demographic analysis using R and the tidyverse. ::: {.callout-tip icon="true"} ## What You've Learned **Data Skills:** ✅ Loading data from URLs with `read_csv()` ✅ Cleaning data with tidyverse functions ✅ **String splitting and parsing** with `str_split_i()`, `str_remove()`, and regex ✅ **Extracting information from structured text** (gender from "Population - Male (Number)") ✅ **Recategorizing data** with `case_when()` and `recode_values()` for age groups ✅ **Dropping rows** with `filter_out()` (dplyr 1.2.0) ✅ Grouping and summarizing with `group_by()` and `summarise()` ✅ Reshaping data with `pivot_wider()` and `pivot_longer()` ✅ Calculating percentages and ratios **Visualization Skills:** ✅ Creating pie charts and bar charts ✅ Customizing colors and themes ✅ Adding informative labels and titles ✅ Using `coord_flip()` for horizontal layouts ✅ Understanding Grammar of Graphics principles **Table Skills:** ✅ Building professional tables with gt ✅ Formatting numbers and percentages ✅ Adding colors and styling ✅ Creating informative footnotes **String Processing Skills:** ✅ Multiple methods for text extraction (split, regex, remove, separate) ✅ Using `separate_wider_delim()` to split into multiple columns ✅ Using `str_split_i()` to extract specific pieces ✅ **Case conversion** with `str_to_camel()`, `str_to_snake()`, `str_to_kebab()` (stringr 1.6.0) ✅ Understanding when to use each method ✅ Regular expressions for pattern matching **Workflow Skills:** ✅ Using the pipe operator `|>` for readable code ✅ Writing clear, commented code ✅ Creating reproducible analyses ✅ Structuring code in logical steps ::: ::: {.callout-note} ## Next Steps for Learning **Beginner:** 1. Practice with different datasets 2. Try modifying the colors and themes 3. Experiment with different chart types **Intermediate:** 4. Learn about `purrr` for functional programming 5. Explore `stringr` for text manipulation 6. Study `lubridate` for date handling **Advanced:** 7. Create interactive dashboards with Shiny 8. Build custom functions and packages 9. Contribute to open-source R projects **Resources:** - [R for Data Science](https://r4ds.hadley.nz/) - Free online book - [RStudio Cheatsheets](https://posit.co/resources/cheatsheets/) - Quick references - [TidyTuesday](https://github.com/rfordatascience/tidytuesday) - Weekly practice datasets ::: --- ```{=html}  <hr class="author-section-divider"> <div class="author-card"> <img src="/images/blog/alier-reng-founder.png" alt="Alier Reng" class="author-card-photo"> <div class="author-card-info"> <h3>Alier Reng</h3> <div class="author-card-role">Founder, Lead Educator & Creative Director at PyStatR+</div> <p class="author-card-bio"> Alier Reng is a Data Scientist, Educator, and Founder of PyStatR+, a platform advancing open and practical data science education. His work blends analytics, philosophy, and storytelling to make complex ideas human and empowering. Knowledge is freedom. Data is truth's language — ethics and transparency, its grammar. </p> <div class="author-card-social"> <a href="https://www.pystatrplus.org" title="PyStatR+" aria-label="PyStatR+ Website" class="social-website"> <svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-1 17.93c-3.95-.49-7-3.85-7-7.93 0-.62.08-1.21.21-1.79L9 15v1c0 1.1.9 2 2 2v1.93zm6.9-2.54c-.26-.81-1-1.39-1.9-1.39h-1v-3c0-.55-.45-1-1-1H8v-2h2c.55 0 1-.45 1-1V7h2c1.1 0 2-.9 2-2v-.41c2.93 1.19 5 4.06 5 7.41 0 2.08-.8 3.97-2.1 5.39z"/></svg> <span>Website</span> </a> <a href="https://github.com/Alierwai" title="GitHub" aria-label="GitHub" class="social-github"> <svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 2A10 10 0 0 0 2 12c0 4.42 2.87 8.17 6.84 9.5.5.08.66-.23.66-.5v-1.69c-2.77.6-3.36-1.34-3.36-1.34-.46-1.16-1.11-1.47-1.11-1.47-.91-.62.07-.6.07-.6 1 .07 1.53 1.03 1.53 1.03.87 1.52 2.34 1.07 2.91.83.09-.65.35-1.09.63-1.34-2.22-.25-4.55-1.11-4.55-4.92 0-1.11.38-2 1.03-2.71-.1-.25-.45-1.29.1-2.64 0 0 .84-.27 2.75 1.02.79-.22 1.65-.33 2.5-.33.85 0 1.71.11 2.5.33 1.91-1.29 2.75-1.02 2.75-1.02.55 1.35.2 2.39.1 2.64.65.71 1.03 1.6 1.03 2.71 0 3.82-2.34 4.66-4.57 4.91.36.31.69.92.69 1.85V21c0 .27.16.59.67.5C19.14 20.16 22 16.42 22 12A10 10 0 0 0 12 2z"/></svg> <span>GitHub</span> </a> <a href="https://www.linkedin.com/in/alierreng" title="LinkedIn" aria-label="LinkedIn" class="social-linkedin"> <svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M20.45 20.45h-3.56v-5.57c0-1.33-.02-3.04-1.85-3.04-1.85 0-2.14 1.45-2.14 2.94v5.67H9.34V9h3.41v1.56h.05c.48-.9 1.64-1.85 3.37-1.85 3.6 0 4.27 2.37 4.27 5.46v6.28zM5.34 7.43a2.06 2.06 0 1 1 0-4.12 2.06 2.06 0 0 1 0 4.12zM7.12 20.45H3.56V9h3.56v11.45z"/></svg> <span>LinkedIn</span> </a> <a href="https://youtube.com/@PyStatRPlus" title="YouTube" aria-label="YouTube" class="social-youtube"> <svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M23.498 6.186a3.016 3.016 0 0 0-2.122-2.136C19.505 3.545 12 3.545 12 3.545s-7.505 0-9.377.505A3.017 3.017 0 0 0 .502 6.186C0 8.07 0 12 0 12s0 3.93.502 5.814a3.016 3.016 0 0 0 2.122 2.136c1.871.505 9.376.505 9.376.505s7.505 0 9.377-.505a3.015 3.015 0 0 0 2.122-2.136C24 15.93 24 12 24 12s0-3.93-.502-5.814zM9.545 15.568V8.432L15.818 12l-6.273 3.568z"/></svg> <span>YouTube</span> </a> </div> </div> </div> ``` --- ## Editor's Note This tutorial reflects a deliberate editorial balance between accessibility and technical depth. While R offers many approaches to data manipulation, this guide emphasizes the **tidyverse philosophy**—particularly dplyr 1.2.0 for data transformation and stringr 1.6.0 for text processing—because these tools prioritize readability and consistency. This edition highlights dplyr 1.2.0's `filter_out()` for clearer row-dropping semantics and `recode_values()` with lookup tables for maintainable value mappings. We also introduce stringr 1.6.0's case conversion trio—`str_to_camel()`, `str_to_snake()`, and `str_to_kebab()`—for seamless interoperability with Python, JavaScript, and API naming conventions. This approach aligns with the PyStatR+ Charter by emphasizing clarity, honesty, and accessibility without unnecessary complexity. --- ## Acknowledgements This lesson is part of the broader **PyStatR+ Learning Platform**, developed with gratitude to mentors, learners, and the open-source community that continually advances the R ecosystem. Special thanks to Hadley Wickham, the tidyverse team, and the contributors who make tools like dplyr, stringr, and ggplot2 possible. --- ## References - [R for Data Science (2nd Edition)](https://r4ds.hadley.nz/) — Wickham, Çetinkaya-Rundel, & Grolemund - [dplyr 1.2.0 Release Notes](https://tidyverse.org/blog/2026/02/dplyr-1-2-0/) — Tidyverse Blog - [Recoding and Replacing Values](https://dplyr.tidyverse.org/articles/recoding-replacing.html) — dplyr vignette - [stringr 1.6.0 Release Notes](https://tidyverse.org/blog/2025/11/stringr-1-6-0/) — Tidyverse Blog - [dplyr Documentation](https://dplyr.tidyverse.org/) - [stringr Documentation](https://stringr.tidyverse.org/) - [ggplot2 Documentation](https://ggplot2.tidyverse.org/) - [gt Package Documentation](https://gt.rstudio.com/) - [South Sudan National Bureau of Statistics](http://southsudan.opendataforafrica.org/) --- **PyStatR+** — *Learning Simplified. Communication Amplified.* 🚀

Demographic Patterns in South Sudan: A Tidyverse Exploration