---
title: "Demographic Patterns in South Sudan: A Tidyverse Exploration"
subtitle: "A Beginner's Guide to Data Analysis with R and the Tidyverse"
author: "Alierwai Reng"
date: "2024-11-20"
categories: [R, Tidyverse, ggplot2, Demographics, Data Visualization]
image: featured.png
description: "Learn data analysis step-by-step using R's tidyverse!"
format:
html:
code-fold: false
code-tools: true
toc: true
toc-depth: 3
execute:
warning: false
message: false
---
# Demographic Patterns in South Sudan: A Tidyverse Exploration
## Demographic Patterns in South Sudan: A Tidyverse Exploration
> Learn data analysis step-by-step using R's tidyverse!
## Introduction {#sec-intro}
# Demographic Patterns in South Sudan: A Tidyverse Exploration
## A Beginner's Guide to Data Analysis with R and the Tidyverse
Learn data analysis step-by-step using R's tidyverse! This beginner-friendly tutorial explores South Sudan's demographics with clear explanations, beautiful visualizations, and professional tables.
Welcome to this hands-on data analysis tutorial! This guide showcases **dplyr 1.2.0** for powerful data manipulation—including the new `filter_out()` and `recode_values()` functions—and introduces key **stringr 1.6.0** functions for cleaning and transforming text data, including the new case conversion trio: `str_to_camel()`, `str_to_snake()`, and `str_to_kebab()`.
By the end of this guide, you'll understand how to:
- **Load and explore** real-world census data
- **Clean and transform** data using tidyverse functions
- **Calculate** summary statistics and group-level metrics
- **Create beautiful visualizations** with ggplot2
- **Build professional tables** with gt
We will analyze South Sudan’s 2008 census data as a practical case study; however, the analytical techniques and workflows you will learn are fully transferable to any dataset across domains and contexts.
The data were obtained from the National Bureau of Statistics, South Sudan, via the Open Data for Africa platform:
Population by Age and Sex (2008) — http://southsudan.opendataforafrica.org/fvjqdpe/population-by-age-and-sex-2008-south-sudan
::: {.callout-tip}
## What is the Tidyverse?
The **tidyverse** is a collection of R packages designed to work together seamlessly for data science. Think of it as a complete toolkit where all the tools fit perfectly in your hand. Key packages include:
- **dplyr** — data manipulation (filter, mutate, select, group, summarize)—the focus of this tutorial
- **stringr** — string manipulation (clean, extract, transform text)—introduced throughout this tutorial
- **ggplot2** - data visualization (charts and graphs)
- **tidyr** - data tidying (reshape and clean)
- **readr** - data import (read CSV, Excel, etc.)
These packages share a common design philosophy, making your code readable and your workflow intuitive!
:::
::: {.callout-note}
## Tested With
R 4.4.x, dplyr 1.2.0, stringr 1.6.0, ggplot2 3.5.x, gt 0.11.x.
:::
::: {.callout-important}
## What's New in dplyr 1.2.0 & stringr 1.6.0
**dplyr 1.2.0** (released February 2026) introduces powerful new tools:
- **[`filter_out()`](https://dplyr.tidyverse.org/reference/filter_out.html)** — the missing complement to `filter()`. Drop rows instead of keeping them, with cleaner boolean logic.
- **[`recode_values()`](https://dplyr.tidyverse.org/reference/recode_values.html)** — create entirely new columns by mapping old values to new values. Replaces `case_match()` with a cleaner formula or `from`/`to` interface.
- **[`replace_values()`](https://dplyr.tidyverse.org/reference/replace_values.html)** — partially update an existing column while preserving its type.
- **[`replace_when()`](https://dplyr.tidyverse.org/reference/replace_when.html)** — conditionally replace rows within columns, a type-stable alternative to `if_else()`.
- **[`when_any()`](https://dplyr.tidyverse.org/reference/when_any.html)** and **`when_all()`** — elementwise OR/AND helpers for multi-column conditions.
**stringr 1.6.0** (released November 2025) adds:
- **`str_to_camel()`**, **`str_to_snake()`**, **`str_to_kebab()`** — convert between programming case conventions.
- **`str_ilike()`** — case-insensitive SQL-like pattern matching.
We'll showcase several of these throughout this tutorial!
:::
---
## Part 1: Environment Setup {#sec-setup}
### Step 1: Load Required Packages
Every R analysis starts by loading the libraries (packages) we need. Think of packages as apps on your phone—each one adds specific capabilities.
```{r}
#| label: load-packages
#| code-summary: "Load tidyverse and visualization packages"
# Core tidyverse packages (loads dplyr, ggplot2, tidyr, readr, and more!)
library(tidyverse)
library(janitor)
# Table formatting packages
library(gt) # Grammar of Tables - for beautiful tables
library(gtExtras) # Extra features for gt tables
# Visualization enhancement packages
library(ggtext) # Rich text formatting in ggplot2
library(scales) # Scale functions for axes and labels
library(glue) # Easy string interpolation
# Confirmation message
cat("✅ All packages loaded successfully!\n")
cat("📦 Tidyverse version:", as.character(packageVersion("tidyverse")), "\n")
```
::: {.callout-note}
## Package Installation
If needed, install with: `install.packages(c("tidyverse", "gt", "gtExtras", "ggtext", "scales", "glue"))`
:::
### Step 2: Configure Visualization Theme
```{r}
#| label: setup-theme
#| code-summary: "Configure default ggplot2 theme"
theme_set(
theme_minimal(base_size = 13, base_family = "sans") +
theme(
plot.title = element_markdown(size = 16, face = "bold", color = "#06b6d4", margin = margin(b = 10)),
plot.subtitle = element_markdown(size = 12, color = "#666666", margin = margin(b = 15)),
plot.caption = element_markdown(size = 9, color = "#999999", hjust = 0),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "#e5e5e5", linewidth = 0.3),
legend.position = "top",
legend.title = element_text(face = "bold", size = 11),
axis.title = element_text(face = "bold", size = 11)
)
)
cat("🎨 Custom theme configured!\n")
```
---
## Part 2: Loading and Exploring Data {#sec-load}
### Step 3: Load Census Data from URL
```{r}
#| label: load-data
#| code-summary: "Load census data from GitHub"
url <- "https://raw.githubusercontent.com/tongakuot/r_tutorials/main/06-data-wrangling/00-input/ss_2008_census_data_raw.csv"
census_raw <- read_csv(url, show_col_types = FALSE)
cat("✅ Data loaded successfully!\n")
cat("📋 Dimensions:", nrow(census_raw), "rows ×", ncol(census_raw), "columns\n")
```
### Step 4: Examine the Data Structure
```{r}
#| label: examine-structure
#| code-summary: "View data structure with glimpse()"
glimpse(census_raw)
```
### Step 5: Preview the Data
```{r}
#| label: preview-data
#| code-summary: "View first 10 rows"
# head() shows the first n rows
census_raw |>
head(10)
```
---
## Part 3: Data Cleaning {#sec-clean}
Raw data is rarely analysis-ready. We need to clean and standardize it first!
::: {.callout-note}
## Introducing stringr for Text Cleaning
The **stringr** package (part of tidyverse) provides consistent, intuitive functions for string manipulation. In this section, we'll use several key functions:
| Function | Purpose | Example |
|----------|---------|---------|
| `str_to_lower()` | Convert to lowercase | "HELLO" → "hello" |
| `str_to_upper()` | Convert to UPPERCASE | "hello" → "HELLO" |
| `str_to_title()` | Convert to Title Case | "hello world" → "Hello World" |
| `str_to_sentence()` | Sentence case | "hello world" → "Hello world" |
| `str_to_camel()` | Convert to camelCase | "quick brown fox" → "quickBrownFox" |
| `str_to_snake()` | Convert to snake_case | "Quick Brown Fox" → "quick_brown_fox" |
| `str_to_kebab()` | Convert to kebab-case | "Quick Brown Fox" → "quick-brown-fox" |
| `str_squish()` | Remove extra whitespace | " hello world " → "hello world" |
| `str_replace_all()` | Replace patterns | "a b c" → "a_b_c" |
The first seven are case conversion functions. The trio `str_to_camel()`, `str_to_snake()`, and `str_to_kebab()` are **new in stringr 1.6.0** and convert between programming naming conventions—essential when bridging R data with Python, JavaScript, or API outputs.
:::
#### stringr 1.6.0: Case Conversion in Action
Before stringr 1.6.0, converting between camelCase, snake_case, and kebab-case required manual regex or external packages like `snakecase`. Now these conversions are built in:
```{r}
#| label: stringr-case-demo
#| code-summary: "Demonstrate new stringr 1.6.0 case conversion functions"
#| eval: false
# New in stringr 1.6.0: convert between programming case conventions
demo_text <- "south sudan census data"
cat("Original: ", demo_text, "\n")
cat("camelCase: ", str_to_camel(demo_text), "\n")
cat("PascalCase: ", str_to_camel(demo_text, first_upper = TRUE), "\n")
cat("snake_case: ", str_to_snake(demo_text), "\n")
cat("kebab-case: ", str_to_kebab(demo_text), "\n")
cat("Title Case: ", str_to_title(demo_text), "\n")
```
::: {.callout-important}
## Transition: Old → New Case Conversion
**Before stringr 1.6.0** — manual regex or external package:
```r
# Required snakecase package or manual work
snakecase::to_snake_case("South Sudan Census")
gsub(" ", "-", tolower("South Sudan Census"))
```
**After stringr 1.6.0** — native, consistent, pipe-friendly:
```r
# Built into stringr — works seamlessly in tidyverse pipelines
"South Sudan Census" |> str_to_snake() # "south_sudan_census"
"South Sudan Census" |> str_to_kebab() # "south-sudan-census"
"South Sudan Census" |> str_to_camel() # "southSudanCensus"
```
These are particularly valuable when bridging R data with Python (`snake_case`), JavaScript (`camelCase`), or URL slugs (`kebab-case`).
:::
### Step 6: Clean and Transform the Dataset
Here's a practical cleaning pipeline using the `janitor` package for automatic name standardization:
```{r}
#| label: clean-data
#| code-summary: "Complete data cleaning pipeline"
#| code-line-numbers: true
census_clean <-
census_raw |>
# Standardize column names automatically
clean_names() |>
# Rename to meaningful names
select(
state = region_name,
gender = variable_name,
age_category = age_name,
population = x2008
) |>
# Clean text columns and convert population to integer
mutate(
across(where(is.character), \(x) str_squish(x) |> str_to_title()),
population = as.integer(population)
) |>
# Remove rows with missing or invalid data
filter(!is.na(population), population > 0)
# Display results
cat("✅ Data cleaning complete!\n")
cat("📊 Cleaned dataset:", nrow(census_clean), "rows ×", ncol(census_clean), "columns\n")
```
::: {.callout-tip icon="true"}
## The Cleaning Pipeline Explained
**Key steps:**
- **`clean_names()`**: Automatically standardizes column names (lowercase, underscores, safe for R)
- **`select()`**: Chooses columns and renames them simultaneously
- **`across(where(is.character), ...)`**: Applies the same transformation to all text columns at once
- **`str_squish()` + `str_to_title()`**: Removes extra spaces and capitalizes properly
- **`as.integer()`**: Converts population to whole numbers (appropriate for count data)
- **`filter()`**: Removes rows with missing or invalid population values
This single approach combines efficiency with clarity!
:::
### Step 7: Preview the Cleaned Data
```{r}
#| label: preview-cleaned
#| code-summary: "View cleaned data"
census_clean |>
head(10)
```
---
## Part 3B: Advanced String Processing {#sec-string-processing}
Now we need to extract more meaningful information from our data. The `gender` column actually contains structured text like "Population - Male (Number)" that we can parse!
### Step 8: Examine the Gender Column Structure
```{r}
#| label: examine-gender
#| code-summary: "Explore gender column values"
# See unique values in gender column
cat("🔍 Unique values in gender column:\n")
census_clean |>
distinct(gender) |>
pull(gender)
```
The gender column follows a pattern: "Population - Gender (Type)" where we need to extract just the gender value.
### Step 9: Extract Gender Information
The `gender` column contains structured text like "Population - Male (Number)". We'll extract just the gender value using the most practical approaches.
#### Method 1: Using str_split_i() — Simple and Direct
Split the text and extract the piece you need:
```{r}
#| label: gender-method-1
#| code-summary: "Method 1: Extract with str_split_i()"
# Apply gender extraction to our dataset
census_parsed <- census_clean |>
mutate(
# Split on " ", extract 2nd piece, then remove "(Number)" text
gender = str_split_i(gender, " ", 2) |>
str_remove(" \\(.*\\)") |>
str_squish()
)
# Verify extraction worked
cat("✅ Gender extraction complete!\n")
cat("🎯 Unique gender values:\n")
census_parsed |>
distinct(gender) |>
pull(gender)
```
::: {.callout-tip}
## How This Works
- **`str_split_i(gender, " ", 2)`**: Split on spaces, extract the 2nd piece
- "Population - Male (Number)" → "- Male (Number)"
- **`str_remove(" \\(.*\\)")`**: Remove " (anything)" pattern
- "- Male (Number)" → "- Male"
- **`str_squish()`**: Clean extra whitespace
- "- Male" → "Male"
This is the most practical approach for extraction tasks!
:::
#### Method 2: Using separate_wider_delim() — When You Need Multiple Pieces
If you need to keep multiple pieces from a split, use `separate_wider_delim()`:
```{r}
#| label: gender-method-2-separate
#| code-summary: "Method 2: separate_wider_delim() for multiple columns"
# Demonstrate separate_wider_delim()
demo_separate <- census_clean |>
select(original_gender = gender) |>
separate_wider_delim(
cols = original_gender,
delim = " ",
names = c("prefix", "gender_raw", "suffix"),
too_few = "align_start",
cols_remove = FALSE
) |>
mutate(gender_clean = str_remove(gender_raw, " \\(.*\\)")) |>
select(original_gender, prefix, gender_raw, gender_clean) |>
distinct() |>
head(3)
cat("✅ separate_wider_delim() keeps all pieces:\n")
demo_separate
```
::: {.callout-note}
## When to Use Each Method
**`str_split_i()` (Method 1):**
- You only need one piece
- More concise code
- Recommended for this task
**`separate_wider_delim()` (Method 2):**
- You need multiple pieces as separate columns
- Better for data reshaping workflows
- Excellent when working with structured delimited text
**For our case:** We use Method 1 because we only extract the gender value.
:::
#### Method 3: Regular Expressions — For Complex Patterns
For more complex text patterns, use regex with `str_extract()`:
```{r}
#| label: gender-method-3-regex
#| code-summary: "Method 3: Regex pattern matching"
# Regex approach (for reference/learning)
demo_regex <- census_clean |>
select(original_gender = gender) |>
mutate(
# Regex: text between " " and " ("
gender_regex = str_extract(original_gender, "(?<= ).*(?= \\()")
) |>
distinct() |>
head(3)
cat("✅ Regex extraction with lookaround assertions:\n")
demo_regex
```
::: {.callout-note}
## Regex Lookaround Basics
`"(?<= ).*(?= \\()"` extracts text between two patterns:
- `(?<= )` — Lookbehind: preceded by space-dash-space
- `.*` — Match: any characters
- `(?= \\()` — Lookahead: followed by " ("
**Result:** Extracts "Male" from "Population - Male (Number)"
Additional approaches (regex lookahead, separate + unite, etc.) are covered in the [stringr documentation](https://stringr.tidyverse.org/).
:::
### Step 10: Recategorize Age Groups
Let's group the fine-grained 5-year age bands into broader, more interpretable categories:
```{r}
#| label: examine-age-categories
#| code-summary: "View current age categories"
cat("🔍 Current age categories:\n")
census_parsed |>
distinct(age_category) |>
arrange(age_category) |>
pull(age_category)
```
#### Method 1: Using case_when() for Conditional Recategorization
The `case_when()` function is perfect for complex, multi-condition transformations:
```{r}
#| label: age-method-1
#| code-summary: "Method 1: case_when() for age recategorization"
#| code-line-numbers: true
census_final <- census_parsed |>
mutate(
age_category = str_to_lower(age_category),
age_category = case_when(
# Children (0-14)
age_category %in% c("0 to 4", "5 to 9", "10 to 14") ~ "0-14",
# Youth (15-24)
age_category %in% c("15 to 19", "20 to 24") ~ "15-24",
# Early working age (25-34)
age_category %in% c("25 to 29", "30 to 34") ~ "25-34",
# Middle working age (35-44)
age_category %in% c("35 to 39", "40 to 44") ~ "35-44",
# Later working age (45-54)
age_category %in% c("45 to 49", "50 to 54") ~ "45-54",
# Pre-retirement (55-64)
age_category %in% c("55 to 59", "60 to 64") ~ "55-64",
# Retirement age (65+)
age_category == "65+" ~ "65+",
# Catch any unexpected values
TRUE ~ age_category
)
)
# Verify recategorization
cat("✅ Age categories recategorized!\n")
cat("🎯 New age categories:\n")
census_final |>
distinct(age_category) |>
arrange(age_category) |>
pull(age_category)
```
::: {.callout-important icon="true"}
## Understanding case_when()
`case_when()` is like a multi-way IF statement (similar to SQL's CASE WHEN):
**Structure:**
```r
case_when(
condition1 ~ result1, # If condition1 is TRUE, return result1
condition2 ~ result2, # Else if condition2 is TRUE, return result2
condition3 ~ result3, # Else if condition3 is TRUE, return result3
TRUE ~ default # Else return default (catch-all)
)
```
**Key points:**
- Conditions are evaluated **in order** (first match wins)
- `%in%` checks if value is in a vector (like "is one of")
- `~` separates condition from result
- `TRUE ~` at the end catches anything not matched above
**Example for our data:**
- If age is "0 to 4" OR "5 to 9" OR "10 to 14" → return "0-14"
- Else if age is "15 to 19" OR "20 to 24" → return "15-24"
- And so on...
:::
#### Method 2: Using recode_values() — New in dplyr 1.2.0
For value-to-value mappings, the new `recode_values()` is cleaner than `case_when()`. It accepts a `from`/`to` lookup table or formula syntax:
```{r}
#| label: age-method-2-recode-values
#| code-summary: "Method 2: recode_values() with lookup table (dplyr 1.2.0)"
#| eval: true
# Build a lookup table — clean and portable
age_lookup <- tribble(
~from, ~to,
"0 to 4", "0-14",
"5 to 9", "0-14",
"10 to 14", "0-14",
"15 to 19", "15-24",
"20 to 24", "15-24",
"25 to 29", "25-34",
"30 to 34", "25-34",
"35 to 39", "35-44",
"40 to 44", "35-44",
"45 to 49", "45-54",
"50 to 54", "45-54",
"55 to 59", "55-64",
"60 to 64", "55-64",
"65+", "65+"
)
# Demonstrate recode_values() (not applied to dataset)
demo_recode <- census_parsed |>
select(age_category) |>
mutate(
age_category = str_to_lower(age_category),
age_recoded = recode_values(
age_category,
from = age_lookup$from,
to = age_lookup$to
)
) |>
distinct()
cat("✅ recode_values() demonstration:\n")
demo_recode
```
::: {.callout-tip}
## Comparing case_when() vs recode_values()
**Use `case_when()` when:**
- Multiple conditions per category (AND/OR logic)
- Complex conditional expressions
- **Best when conditions aren't simple equality checks**
**Use `recode_values()` (dplyr 1.2.0) when:**
- Simple value-to-value mappings
- Lookup table stored externally (CSV, tribble)
- Cleaner syntax for 1-to-1 replacements
- **Replaces the superseded `case_match()` and `recode()`**
Both work for our use case. We use `case_when()` above because it groups related age bands together, making the logic visible. For production code with many mappings, `recode_values()` with a lookup table is more maintainable.
:::
::: {.callout-important}
## Transition: Old → New Recoding Functions
**Superseded — `recode()` (avoid in new code):**
```r
recode(x, "0 to 4" = "0-14", "5 to 9" = "0-14", .default = x)
```
**Soft-deprecated — `case_match()` (migrate to `recode_values()`):**
```r
case_match(x, c("0 to 4", "5 to 9", "10 to 14") ~ "0-14", .default = x)
```
**New — `recode_values()` with formula syntax:**
```r
recode_values(x, "0 to 4" ~ "0-14", "5 to 9" ~ "0-14", default = x)
```
**New — `recode_values()` with lookup table (recommended for many mappings):**
```r
recode_values(x, from = age_lookup$from, to = age_lookup$to)
```
**New — `replace_values()` for partial updates (preserves column type):**
```r
replace_values(x, "Total" ~ "All Genders")
```
The `recode_values()` function replaces both `recode()` and `case_match()`. Use `replace_values()` when you only need to change a few values while keeping the rest intact.
:::
## Step 11: Filter and Verify Final Dataset
Remove aggregate rows before analysis. In **dplyr 1.2.0**, the new `filter_out()` makes this intent explicit—you specify what to *drop* rather than negate conditions:
```{r}
#| label: filter-out-demo
#| code-summary: "Drop aggregate rows before analysis"
# Remove rows where gender or age_category is "Total"
# In dplyr 1.2.0, you can use: filter_out(gender == "Total" | age_category == "Total")
census_filtered <- census_final |>
filter_out(gender == "Total" | age_category == "total")
census_filtered
```
::: {.callout-important}
## Transition: Old → New Row Filtering
**Before dplyr 1.2.0 — negated logic with `filter()`:**
```r
# Awkward double negation — "keep rows where gender is NOT Total AND age is NOT Total"
census_final |> filter(gender != "Total", age_category != "Total")
```
**After dplyr 1.2.0 — direct intent with `filter_out()`:**
```r
# Clear intent — "drop rows where gender IS Total OR age IS Total"
census_final |> filter_out(gender == "Total" | age_category == "total")
```
**Why this matters:**
- `filter()` keeps rows that match → forces negated logic to drop rows (`!=`, `!`)
- `filter_out()` drops rows that match → write positive conditions, cleaner boolean logic
- `filter_out()` handles `NA` values more predictably (rows with `NA` are kept, not silently dropped)
**Bonus — `when_any()` and `when_all()` for multi-column conditions:**
```r
# Drop rows where ANY column equals "Total" (works across many columns!)
census_final |> filter_out(when_any(c(gender, age_category), ~ . == "Total"))
```
These helpers compose naturally with both `filter()` and `filter_out()`, making multi-column conditions readable.
:::
Let's confirm all transformations worked correctly:
```{r}
#| label: verify-final
#| code-summary: "Verify final cleaned dataset"
# Generate a comprehensive summary of the cleaned and transformed dataset
# This verification step ensures all transformations were applied correctly
cat(strrep("=", 50), "\n", sep = "")
cat("🎉 DATA TRANSFORMATION COMPLETE!\n")
cat(strrep("=", 50), "\n", sep = "")
# Display dimensions of the final cleaned dataset
cat("📊 Final dataset dimensions:\n")
cat(" Rows:", nrow(census_filtered), "\n")
cat(" Columns:", ncol(census_filtered), "\n\n")
# List all column names in the cleaned dataset
cat("✅ Column names:\n")
cat(" ", paste(names(census_final), collapse = ", "), "\n\n")
# Show all unique gender categories (should only be "Male", "Female", and "Total" before filtering)
cat("🎯 Unique gender values:\n")
census_filtered |> distinct(gender) |> pull(gender) |> cat(" ", "\n")
# Show all unique age categories in sorted order
cat("\n🎯 Unique age categories:\n")
census_filtered |> distinct(age_category) |> arrange(age_category) |> pull(age_category) |> cat(" ", "\n")
# Display first 10 rows of key columns to visually inspect the data
cat("\n📋 Sample of final data:\n")
census_filtered |>
select(state, gender, age_category, population) |>
head(10)
```
::: {.callout-note}
## Data Transformation Summary
**What we accomplished:**
**Original `gender` column values:**
- "Population - Male (Number)"
- "Population - Female (Number)"
- "Population - Total (Number)"
**Transformed to clean format:**
- "Male"
- "Female"
- "Total"
**Original `age_category` column structure:**
- 16 individual 5-year age bands
- Examples: "0 To 4", "5 To 9", "10 To 14", ..., "65+"
**Transformed to standardized life-stage categories:**
- 7 broader, more interpretable age groups
- "0-14", "15-24", "25-34", "35-44", "45-54", "55-64", "65+"
**Result:** A clean, standardized dataset ready for analysis and visualization! ✨
:::
---
## Part 4: Data Exploration and Summary {#sec-explore}
### Step 12: Create Overview Statistics
Let's calculate some key statistics about our dataset:
```{r}
#| label: overview-stats
#| code-summary: "Calculate summary statistics"
# Create a summary table
overview_table <- census_filtered |>
summarise(
`Total Population` = comma(sum(population)), # Format with commas
`Number of States` = n_distinct(state), # Count unique states
`Age Categories` = n_distinct(age_category), # Count unique ages
`Gender Groups` = n_distinct(gender), # Count unique genders
`Total Observations` = comma(n()) # Count all rows
)
# Display the summary
overview_table
```
::: {.callout-note}
## Understanding summarise()
`summarise()` collapses data into summary statistics:
- `sum()` - adds up values
- `n_distinct()` - counts unique values
- `n()` - counts total rows
- `comma()` - formats numbers with commas (from scales package)
It reduces many rows into one row of summaries!
:::
### Step 13: Display as Professional Table
Now let's make this summary look professional using the `gt` package:
```{r}
#| label: overview-table-styled
#| code-summary: "Create styled table with gt"
overview_table |>
gt() |>
tab_header(
title = md("**South Sudan 2008 Census Overview**"),
subtitle = "Key Summary Statistics"
) |>
tab_style(
style = cell_fill(color = "#22d3ee"),
locations = cells_body()
) |>
tab_style(
style = cell_text(color = "white", weight = "bold"),
locations = cells_body()
) |>
tab_options(
table.font.size = px(13),
heading.title.font.size = px(16),
heading.subtitle.font.size = px(12)
)
```
::: {.callout-tip icon="true"}
## The Grammar of Tables (gt)
The `gt` package uses a **layered approach** (like ggplot2 for tables):
1. **Start with data** → `gt()`
2. **Add headers** → `tab_header()`
3. **Style cells** → `tab_style()`
4. **Format numbers** → `fmt_number()`
5. **Adjust options** → `tab_options()`
Each layer adds or modifies the table appearance!
:::
---
## Part 5: Gender Analysis {#sec-gender}
### Step 14: Calculate National Gender Distribution
Let's analyze how the population is distributed by gender:
```{r}
#| label: gender-summary
#| code-summary: "Calculate gender distribution"
#| code-line-numbers: true
gender_summary <- census_filtered |>
# Step 1: Group data by gender
group_by(gender) |>
# Step 2: Calculate total population for each gender
summarise(
population = sum(population),
.groups = "drop" # Remove grouping after summarise
) |>
# Step 3: Calculate percentages
mutate(
percentage = population / sum(population) * 100,
percentage_label = percent(percentage / 100, accuracy = 0.01)
) |>
# Step 4: Sort by population (largest first)
arrange(desc(population))
# Display the results
gender_summary
```
::: {.callout-important}
## Understanding group_by() and summarise()
These two functions work together like a team:
**`group_by(gender)`**
- Splits data into groups (one for Male, one for Female)
- Like separating cards into piles
**`summarise(population = sum(population))`**
- Performs calculations within each group
- `sum()` adds up all population values in each group
- Like counting cards in each pile
**`.groups = "drop"`**
- Removes the grouping after we're done
- Prevents unexpected behavior in future operations
**Final result:** One row per gender with total population!
:::
### Step 15: Display Gender Table
```{r}
#| label: gender-table
#| code-summary: "Create styled gender distribution table"
gender_summary |>
# Rename columns for display
select(
Gender = gender,
Population = population,
`Percentage` = percentage_label
) |>
# Create gt table
gt() |>
# Add title and subtitle
tab_header(
title = md("**National Gender Distribution**"),
subtitle = "South Sudan 2008 Census"
) |>
# Format population with commas
fmt_number(
columns = Population,
decimals = 0,
use_seps = TRUE
) |>
# Style Male row (first row) - cyan background
tab_style(
style = list(
cell_fill(color = "#22d3ee"),
cell_text(color = "white", weight = "bold")
),
locations = cells_body(rows = 1)
) |>
# Style Female row (second row) - gold background
tab_style(
style = list(
cell_fill(color = "#FFD700"),
cell_text(color = "#000000", weight = "bold")
),
locations = cells_body(rows = 2)
) |>
# Center all columns
cols_align(align = "center", columns = everything()) |>
# Adjust font sizes
tab_options(
table.font.size = px(13),
heading.title.font.size = px(16)
)
```
### Step 16: Visualize Gender Distribution
Numbers are great, but visualizations make patterns instantly clear. Let's create a pie chart:
```{r}
#| label: gender-viz
#| code-summary: "Create pie chart for gender distribution"
#| fig-width: 10
#| fig-height: 6
#| fig-cap: "Gender distribution shown as a pie chart with population counts and percentages"
ggplot(gender_summary, aes(x = "", y = population, fill = gender)) +
# Create a bar chart (we'll turn it into a pie)
geom_col(width = 1, color = "white", linewidth = 2) +
# Convert bar chart to pie chart using polar coordinates
coord_polar(theta = "y") +
# Set custom colors for Male and Female
scale_fill_manual(
values = c(
"Male" = "#22d3ee",
"Female" = "#FFD700"
)
) +
# Add labels showing counts and percentages
geom_text(
aes(label = glue("{comma(population)}\n({percentage_label})")),
position = position_stack(vjust = 0.5), # Center in each slice
size = 5,
fontface = "bold",
color = "white"
) +
# Add titles and labels
labs(
title = "**Gender Distribution in South Sudan**",
subtitle = "2008 Census Data",
fill = "Gender"
) +
# Use void theme for pie charts (removes axes)
theme_void() +
# Customize title and legend
theme(
plot.title = element_markdown(
size = 16,
face = "bold",
color = "#06b6d4",
hjust = 0.5, # Center title
margin = margin(b = 5)
),
plot.subtitle = element_markdown(
size = 12,
color = "#666666",
hjust = 0.5 # Center subtitle
),
legend.position = "bottom",
legend.title = element_text(face = "bold", size = 11)
)
```
::: {.callout-tip}
## Anatomy of a ggplot2 Chart
Every ggplot2 visualization follows this pattern:
1. **Start with data** → `ggplot(data, aes(...))`
2. **Add geometry** → `geom_*()` (point, line, bar, etc.)
3. **Adjust scales** → `scale_*()` (colors, axes, etc.)
4. **Add labels** → `labs()` (title, axes, etc.)
5. **Apply theme** → `theme_*()` (appearance)
Think of it like building with LEGO blocks—each layer adds something!
**Bonus:** `coord_polar()` transforms rectangular plots into circular ones (bar chart → pie chart)!
:::
### Step 17: Gender Distribution by State
Now let's see how gender distribution varies across different states:
```{r}
#| label: state-gender-analysis
#| code-summary: "Calculate state-level gender statistics"
state_gender <- census_filtered |>
# Group by both state AND gender
group_by(state, gender) |>
# Sum population within each state-gender combination
summarise(population = sum(population), .groups = "drop") |>
# Reshape from long to wide format
# Before: Multiple rows per state (one for Male, one for Female)
# After: One row per state (Male and Female as separate columns)
pivot_wider(names_from = gender, values_from = population) |>
# Calculate additional metrics
mutate(
total = Male + Female, # Total population
male_pct = Male / total * 100, # Male percentage
female_pct = Female / total * 100, # Female percentage
gender_ratio = Male / Female * 100 # Males per 100 females
) |>
# Sort by total population (largest first)
arrange(desc(total))
# Display top 5 states
state_gender |>
head(5)
```
::: {.callout-note}
## Understanding pivot_wider()
`pivot_wider()` reshapes data from **long** to **wide** format:
**Before (Long format):**
```
State Gender Population
Jonglei Male 734327
Jonglei Female 624275
Warrap Male 470734
Warrap Female 502194
```
**After (Wide format):**
```
State Male Female Total
Jonglei 734327 624275 1358602
Warrap 470734 502194 972928
```
Why? Because it's easier to calculate ratios and percentages when Male and Female are in separate columns!
:::
### Step 18: Display State Gender Table
```{r}
#| label: state-gender-table
#| code-summary: "Create styled state gender table"
state_gender |>
head(5) |>
# Select and rename columns for display
select(
State = state,
Male,
Female,
Total = total,
`Male %` = male_pct,
`Female %` = female_pct,
`Gender Ratio` = gender_ratio
) |>
# Create table
gt(rowname_col = "State") |>
cols_align(columns = State, align = "right") |>
# Add header
tab_header(
title = md("**Gender Distribution by State**"),
subtitle = "Top 10 Most Populous States"
) |>
# Format population columns with commas
fmt_number(
columns = c(Male, Female, Total),
decimals = 0,
use_seps = TRUE
) |>
# Format percentage and ratio columns
fmt_number(
columns = c(`Male %`, `Female %`, `Gender Ratio`),
decimals = 2
) |>
# Add color gradient to Gender Ratio
# Values near 100 are balanced (white)
# Values far from 100 show imbalance (colored)
data_color(
columns = `Gender Ratio`,
palette = c("#FFD700", "#ffffff", "#22d3ee"),
domain = c(90, 120)
) |>
# Highlight State column
tab_style(
style = cell_fill(color = "#f8f9fa"),
locations = cells_body(columns = State)
) |>
# Add footnote explaining Gender Ratio
tab_footnote(
footnote = "Gender Ratio represents males per 100 females. Values near 100 indicate balance.",
locations = cells_column_labels(columns = `Gender Ratio`)
) |>
# Apply pre-built theme
gt_theme_538(quiet = TRUE) |>
# Adjust font sizes
tab_options(
table.font.size = px(12),
heading.title.font.size = px(16),
footnotes.font.size = px(10)
)
```
::: {.callout-important}
## Understanding Gender Ratio
**Gender Ratio** = (Males / Females) × 100
- **Ratio = 100**: Perfect balance (equal males and females)
- **Ratio > 100**: More males than females
- **Ratio < 100**: More females than males
For example:
- Ratio of 105 means 105 males per 100 females (5% more males)
- Ratio of 95 means 95 males per 100 females (5% fewer males)
:::
### Step 19: Visualize State Gender Distribution
```{r}
#| label: state-gender-viz
#| code-summary: "Create grouped bar chart by state and gender"
#| fig-width: 12
#| fig-height: 8
#| fig-cap: "Population by state and gender for the top 5 most populous states"
state_gender |>
head(5) |>
# Convert from wide to long format for plotting
# Need separate rows for Male and Female to create grouped bars
pivot_longer(
cols = c(Male, Female),
names_to = "gender",
values_to = "population"
) |>
# Reorder states by total population for better visualization
mutate(state = fct_reorder(state, total)) |>
# Create plot
ggplot(aes(x = state, y = population, fill = gender)) +
# Grouped bar chart (bars side by side)
geom_col(position = "dodge", alpha = 0.9, width = 0.7) +
# Set colors
scale_fill_manual(
values = c(
"Male" = "#22d3ee",
"Female" = "#FFD700"
)
) +
# Format y-axis labels (show as "100K" instead of "100000")
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "K")) +
# Flip coordinates (horizontal bars are easier to read)
coord_flip() +
# Add labels
labs(
title = "**Population by State and Gender**",
subtitle = "Top 5 Most Populous States | South Sudan 2008 Census",
x = NULL, # Remove x-axis label (it says "state" which is obvious)
y = "Population",
fill = "Gender"
) +
# Customize theme
theme(
panel.grid.major.y = element_blank(), # Remove horizontal grid lines
panel.grid.major.x = element_line(color = "#e5e5e5"),
legend.position = "top"
)
```
::: {.callout-tip}
## Choosing the Right Chart Type
**Grouped Bar Chart** (what we used):
- Best for: Comparing categories across groups
- Shows: Exact values for each category
- Advantage: Easy to compare Male vs Female within each state
**Stacked Bar Chart** (alternative):
- Best for: Showing part-to-whole relationships
- Shows: Total and composition
- Advantage: Shows total population at a glance
**Why coord_flip()?**
Long state names are easier to read horizontally than at an angle!
:::
---
## Part 6: Age Category Analysis {#sec-age}
### Step 20: Calculate National Age Distribution
```{r}
#| label: age-summary
#| code-summary: "Calculate age category distribution"
age_summary <- census_filtered |>
# Group by age category
group_by(age_category) |>
# Sum population for each age group
summarise(
population = sum(population),
.groups = "drop"
) |>
# Calculate percentages
mutate(
percentage = population / sum(population) * 100,
percentage_label = percent(percentage / 100, accuracy = 0.01)
) |>
# Sort by population (largest first)
arrange(desc(population))
# Display results
age_summary
```
### Step 21: Display Age Distribution Table
```{r}
#| label: age-table
#| code-summary: "Create styled age distribution table"
age_summary |>
select(
`Age Category` = age_category,
Population = population,
Percentage = percentage_label
) |>
# Create table
gt() |>
# Add header
tab_header(
title = md("**Population Distribution by Age Category**"),
subtitle = "National Summary"
) |>
# Format population with commas
fmt_number(
columns = Population,
decimals = 0,
use_seps = TRUE
) |>
# Add color gradient based on population size
data_color(
columns = Population,
palette = c("#000000", "#0891b2", "#22d3ee", "#FFD700")
) |>
# Make text white on colored backgrounds
tab_style(
style = cell_text(color = "white", weight = "bold"),
locations = cells_body(columns = Population)
) |>
# Add vertical divider between columns
gt_add_divider(columns = `Age Category`, color = "#e5e5e5") |>
# Adjust font sizes
tab_options(
table.font.size = px(13),
heading.title.font.size = px(16)
)
```
### Step 22: Visualize Age Distribution
```{r}
#| label: age-viz
#| code-summary: "Create horizontal bar chart for age distribution"
#| fig-width: 12
#| fig-height: 6
#| fig-cap: "Population distribution across age categories with exact counts labeled"
age_summary |>
# Reorder age categories by population for better visual ranking
mutate(age_category = fct_reorder(age_category, population)) |>
# Create plot
ggplot(aes(x = age_category, y = population, fill = population)) +
# Bar chart
geom_col(alpha = 0.9, show.legend = FALSE) +
# Add text labels showing exact population
geom_text(
aes(label = comma(population)),
hjust = -0.1, # Position slightly outside the bar
size = 3.5,
fontface = "bold",
color = "#06b6d4"
) +
# Color gradient from dark to light
scale_fill_gradient(low = "#000000", high = "#FFD700") +
# Format y-axis and add space for text labels
scale_y_continuous(
labels = label_number(scale = 1e-3, suffix = "K"),
expand = expansion(mult = c(0, 0.15)) # Add 15% space on right for labels
) +
# Horizontal bars
coord_flip() +
# Labels
labs(
title = "**Population Distribution by Age Category**",
subtitle = "South Sudan 2008 Census | National Summary",
x = NULL,
y = "Population"
) +
# Theme adjustments
theme(
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_line(color = "#e5e5e5")
)
```
::: {.callout-note}
## Understanding scale_y_continuous()
**`expansion(mult = c(0, 0.15))`** controls space around the plot:
- First value (0): No extra space on the left
- Second value (0.15): Add 15% extra space on the right
Why? To make room for our text labels showing exact population counts!
Without this, the labels would get cut off at the edge of the plot.
:::
---
## dplyr 1.2.0 & stringr 1.6.0: Migration Quick Reference {#sec-migration}
This tutorial showcased several new functions introduced in **dplyr 1.2.0** and **stringr 1.6.0**. Here is a consolidated before/after reference for migrating your existing code:
| Task | Before | After (2026) |
|------|--------|--------------|
| **Drop rows** | `filter(x != "Total")` | `filter_out(x == "Total")` |
| **Multi-column drop** | `filter(a != "X", b != "Y")` | `filter_out(when_any(c(a, b), ~ . == "X"))` |
| **Recode all values** | `case_match(x, "a" ~ 1, "b" ~ 2)` | `recode_values(x, "a" ~ 1, "b" ~ 2)` |
| **Recode from lookup** | `age_map[x]` (named vector) | `recode_values(x, from = tbl$from, to = tbl$to)` |
| **Replace few values** | `if_else(x == "old", "new", x)` | `replace_values(x, "old" ~ "new")` |
| **Conditional replace** | `if_else(x > 5, 0, x)` | `replace_when(x, x > 5 ~ 0)` |
| **To snake_case** | `snakecase::to_snake_case(x)` | `str_to_snake(x)` |
| **To camelCase** | manual regex | `str_to_camel(x)` |
| **To kebab-case** | `gsub(" ", "-", tolower(x))` | `str_to_kebab(x)` |
| **Case-insensitive LIKE** | `str_detect(x, regex("pat", TRUE))` | `str_ilike(x, "pat")` |
::: {.callout-note}
## Deprecation Timeline
- **`recode()`** — superseded since dplyr 1.1.0; migrate to `recode_values()` or `replace_values()`
- **`case_match()`** — soft-deprecated in dplyr 1.2.0; migrate to `recode_values()`
- **`str_like(ignore_case = TRUE)`** — deprecated in stringr 1.6.0; use `str_ilike()` instead
These old functions continue to work but will emit deprecation warnings. New code should use the replacements above.
:::
---
## Key Insights {#sec-insights}
::: {.callout-important icon="true"}
## What the Data Tells Us
### 1. **Population Concentration**
The top five states account for a significant portion of the national population. This geographic concentration should inform resource allocation decisions.
### 2. **Youth Demographics**
The age distribution reveals a **young population**—typical of developing nations. This "youth bulge" represents both:
- **Opportunity**: Large workforce potential
- **Challenge**: Need for education and employment infrastructure
### 3. **Gender Balance**
Most states show **relatively balanced** gender distributions, with some variation that may reflect:
- Migration patterns
- Conflict impacts
- Data collection methodology
### 4. **Regional Disparities**
Substantial population differences between states suggest the need for:
- Differentiated development strategies
- Targeted resource allocation
- Context-specific policy interventions
:::
---
## Conclusion {#sec-conclusion}
Congratulations! You've completed a comprehensive demographic analysis using R and the tidyverse.
::: {.callout-tip icon="true"}
## What You've Learned
**Data Skills:**
✅ Loading data from URLs with `read_csv()`
✅ Cleaning data with tidyverse functions
✅ **String splitting and parsing** with `str_split_i()`, `str_remove()`, and regex
✅ **Extracting information from structured text** (gender from "Population - Male (Number)")
✅ **Recategorizing data** with `case_when()` and `recode_values()` for age groups
✅ **Dropping rows** with `filter_out()` (dplyr 1.2.0)
✅ Grouping and summarizing with `group_by()` and `summarise()`
✅ Reshaping data with `pivot_wider()` and `pivot_longer()`
✅ Calculating percentages and ratios
**Visualization Skills:**
✅ Creating pie charts and bar charts
✅ Customizing colors and themes
✅ Adding informative labels and titles
✅ Using `coord_flip()` for horizontal layouts
✅ Understanding Grammar of Graphics principles
**Table Skills:**
✅ Building professional tables with gt
✅ Formatting numbers and percentages
✅ Adding colors and styling
✅ Creating informative footnotes
**String Processing Skills:**
✅ Multiple methods for text extraction (split, regex, remove, separate)
✅ Using `separate_wider_delim()` to split into multiple columns
✅ Using `str_split_i()` to extract specific pieces
✅ **Case conversion** with `str_to_camel()`, `str_to_snake()`, `str_to_kebab()` (stringr 1.6.0)
✅ Understanding when to use each method
✅ Regular expressions for pattern matching
**Workflow Skills:**
✅ Using the pipe operator `|>` for readable code
✅ Writing clear, commented code
✅ Creating reproducible analyses
✅ Structuring code in logical steps
:::
::: {.callout-note}
## Next Steps for Learning
**Beginner:**
1. Practice with different datasets
2. Try modifying the colors and themes
3. Experiment with different chart types
**Intermediate:**
4. Learn about `purrr` for functional programming
5. Explore `stringr` for text manipulation
6. Study `lubridate` for date handling
**Advanced:**
7. Create interactive dashboards with Shiny
8. Build custom functions and packages
9. Contribute to open-source R projects
**Resources:**
- [R for Data Science](https://r4ds.hadley.nz/) - Free online book
- [RStudio Cheatsheets](https://posit.co/resources/cheatsheets/) - Quick references
- [TidyTuesday](https://github.com/rfordatascience/tidytuesday) - Weekly practice datasets
:::
---
```{=html}
<!-- Author Card: Alier Reng -->
<hr class="author-section-divider">
<div class="author-card">
<img src="/images/blog/alier-reng-founder.png"
alt="Alier Reng"
class="author-card-photo">
<div class="author-card-info">
<h3>Alier Reng</h3>
<div class="author-card-role">Founder, Lead Educator & Creative Director at PyStatR+</div>
<p class="author-card-bio">
Alier Reng is a Data Scientist, Educator, and Founder of PyStatR+, a platform advancing open and practical data science education. His work blends analytics, philosophy, and storytelling to make complex ideas human and empowering. Knowledge is freedom. Data is truth's language — ethics and transparency, its grammar.
</p>
<div class="author-card-social">
<a href="https://www.pystatrplus.org" title="PyStatR+" aria-label="PyStatR+ Website" class="social-website">
<svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-1 17.93c-3.95-.49-7-3.85-7-7.93 0-.62.08-1.21.21-1.79L9 15v1c0 1.1.9 2 2 2v1.93zm6.9-2.54c-.26-.81-1-1.39-1.9-1.39h-1v-3c0-.55-.45-1-1-1H8v-2h2c.55 0 1-.45 1-1V7h2c1.1 0 2-.9 2-2v-.41c2.93 1.19 5 4.06 5 7.41 0 2.08-.8 3.97-2.1 5.39z"/></svg>
<span>Website</span>
</a>
<a href="https://github.com/Alierwai" title="GitHub" aria-label="GitHub" class="social-github">
<svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 2A10 10 0 0 0 2 12c0 4.42 2.87 8.17 6.84 9.5.5.08.66-.23.66-.5v-1.69c-2.77.6-3.36-1.34-3.36-1.34-.46-1.16-1.11-1.47-1.11-1.47-.91-.62.07-.6.07-.6 1 .07 1.53 1.03 1.53 1.03.87 1.52 2.34 1.07 2.91.83.09-.65.35-1.09.63-1.34-2.22-.25-4.55-1.11-4.55-4.92 0-1.11.38-2 1.03-2.71-.1-.25-.45-1.29.1-2.64 0 0 .84-.27 2.75 1.02.79-.22 1.65-.33 2.5-.33.85 0 1.71.11 2.5.33 1.91-1.29 2.75-1.02 2.75-1.02.55 1.35.2 2.39.1 2.64.65.71 1.03 1.6 1.03 2.71 0 3.82-2.34 4.66-4.57 4.91.36.31.69.92.69 1.85V21c0 .27.16.59.67.5C19.14 20.16 22 16.42 22 12A10 10 0 0 0 12 2z"/></svg>
<span>GitHub</span>
</a>
<a href="https://www.linkedin.com/in/alierreng" title="LinkedIn" aria-label="LinkedIn" class="social-linkedin">
<svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M20.45 20.45h-3.56v-5.57c0-1.33-.02-3.04-1.85-3.04-1.85 0-2.14 1.45-2.14 2.94v5.67H9.34V9h3.41v1.56h.05c.48-.9 1.64-1.85 3.37-1.85 3.6 0 4.27 2.37 4.27 5.46v6.28zM5.34 7.43a2.06 2.06 0 1 1 0-4.12 2.06 2.06 0 0 1 0 4.12zM7.12 20.45H3.56V9h3.56v11.45z"/></svg>
<span>LinkedIn</span>
</a>
<a href="https://youtube.com/@PyStatRPlus" title="YouTube" aria-label="YouTube" class="social-youtube">
<svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M23.498 6.186a3.016 3.016 0 0 0-2.122-2.136C19.505 3.545 12 3.545 12 3.545s-7.505 0-9.377.505A3.017 3.017 0 0 0 .502 6.186C0 8.07 0 12 0 12s0 3.93.502 5.814a3.016 3.016 0 0 0 2.122 2.136c1.871.505 9.376.505 9.376.505s7.505 0 9.377-.505a3.015 3.015 0 0 0 2.122-2.136C24 15.93 24 12 24 12s0-3.93-.502-5.814zM9.545 15.568V8.432L15.818 12l-6.273 3.568z"/></svg>
<span>YouTube</span>
</a>
</div>
</div>
</div>
```
---
## Editor's Note
This tutorial reflects a deliberate editorial balance between accessibility and technical depth. While R offers many approaches to data manipulation, this guide emphasizes the **tidyverse philosophy**—particularly dplyr 1.2.0 for data transformation and stringr 1.6.0 for text processing—because these tools prioritize readability and consistency.
This edition highlights dplyr 1.2.0's `filter_out()` for clearer row-dropping semantics and `recode_values()` with lookup tables for maintainable value mappings. We also introduce stringr 1.6.0's case conversion trio—`str_to_camel()`, `str_to_snake()`, and `str_to_kebab()`—for seamless interoperability with Python, JavaScript, and API naming conventions.
This approach aligns with the PyStatR+ Charter by emphasizing clarity, honesty, and accessibility without unnecessary complexity.
---
## Acknowledgements
This lesson is part of the broader **PyStatR+ Learning Platform**, developed with gratitude to mentors, learners, and the open-source community that continually advances the R ecosystem. Special thanks to Hadley Wickham, the tidyverse team, and the contributors who make tools like dplyr, stringr, and ggplot2 possible.
---
## References
- [R for Data Science (2nd Edition)](https://r4ds.hadley.nz/) — Wickham, Çetinkaya-Rundel, & Grolemund
- [dplyr 1.2.0 Release Notes](https://tidyverse.org/blog/2026/02/dplyr-1-2-0/) — Tidyverse Blog
- [Recoding and Replacing Values](https://dplyr.tidyverse.org/articles/recoding-replacing.html) — dplyr vignette
- [stringr 1.6.0 Release Notes](https://tidyverse.org/blog/2025/11/stringr-1-6-0/) — Tidyverse Blog
- [dplyr Documentation](https://dplyr.tidyverse.org/)
- [stringr Documentation](https://stringr.tidyverse.org/)
- [ggplot2 Documentation](https://ggplot2.tidyverse.org/)
- [gt Package Documentation](https://gt.rstudio.com/)
- [South Sudan National Bureau of Statistics](http://southsudan.opendataforafrica.org/)
---
**PyStatR+** — *Learning Simplified. Communication Amplified.* 🚀
Join the Conversation
Share your thoughts, ask questions, or contribute insights