Demographic Patterns in South Sudan: A Tidyverse Exploration

A Beginner’s Guide to Data Analysis with R and the Tidyverse

Tidyverse

ggplot2

Demographics

Data Visualization

Learn data analysis step-by-step using R’s tidyverse! This beginner-friendly tutorial explores South Sudan’s demographics with clear explanations, beautiful visualizations, and professional tables.

Author

Alierwai Reng

Published

November 20, 2024

Introduction

Welcome to this hands-on data analysis tutorial! By the end of this guide, you’ll understand how to:

Load and explore real-world census data
Clean and transform data using tidyverse functions
Calculate summary statistics and group-level metrics
Create beautiful visualizations with ggplot2
Build professional tables with gt

We’ll analyze South Sudan’s 2008 census data, but the techniques you learn apply to any dataset.

What is the Tidyverse?

The tidyverse is a collection of R packages designed to work together seamlessly for data science. Think of it as a complete toolkit where all the tools fit perfectly in your hand. Key packages include:

dplyr - data manipulation (filter, group, summarize)
ggplot2 - data visualization (charts and graphs)
tidyr - data tidying (reshape and clean)
readr - data import (read CSV, Excel, etc.)

These packages share a common design philosophy, making your code readable and your workflow intuitive!

Part 1: Environment Setup

Step 1: Load Required Packages

Every R analysis starts by loading the libraries (packages) we need. Think of packages as apps on your phone—each one adds specific capabilities.

# Core tidyverse packages (loads dplyr, ggplot2, tidyr, readr, and more!)
library(tidyverse)      
library(janitor)

# Table formatting packages
library(gt)             # Grammar of Tables - for beautiful tables
library(gtExtras)       # Extra features for gt tables

# Visualization enhancement packages
library(ggtext)         # Rich text formatting in ggplot2
library(scales)         # Scale functions for axes and labels
library(glue)           # Easy string interpolation

# Confirmation message
cat("✅ All packages loaded successfully!\n")

✅ All packages loaded successfully!

cat("📦 Tidyverse version:", as.character(packageVersion("tidyverse")), "\n")

📦 Tidyverse version: 2.0.0

Package Installation

If you don’t have these packages installed, run this code once:

install.packages(c("tidyverse", "gt", "gtExtras", "ggtext", "scales", "glue"))

After installation, you only need to load them with library() in each new R session.

Step 2: Configure Visualization Theme

Setting a consistent theme makes all your visualizations look professional and cohesive. This is like choosing a design template for all your charts.

# Set default theme for ALL ggplot2 visualizations
theme_set(
  theme_minimal(base_size = 13, base_family = "sans") +
    theme(
      # Title styling - bold and colored
      plot.title = element_markdown(
        size = 16, 
        face = "bold", 
        color = "#06b6d4",
        margin = margin(b = 10)  # Space below title
      ),
      
      # Subtitle styling - smaller and gray
      plot.subtitle = element_markdown(
        size = 12, 
        color = "#666666",
        margin = margin(b = 15)  # Space below subtitle
      ),
      
      # Caption styling - small and light gray
      plot.caption = element_markdown(
        size = 9, 
        color = "#999999", 
        hjust = 0  # Left-aligned
      ),
      
      # Grid lines
      panel.grid.minor = element_blank(),  # Remove minor grid lines
      panel.grid.major = element_line(color = "#e5e5e5", linewidth = 0.3),
      
      # Legend positioning and styling
      legend.position = "top",
      legend.title = element_text(face = "bold", size = 11),
      
      # Axis labels
      axis.title = element_text(face = "bold", size = 11)
    )
)

cat("🎨 Custom theme configured!\n")

🎨 Custom theme configured!

Why Use theme_set()?

Using theme_set() means you don’t have to add the same theme code to every plot. Set it once at the beginning, and all your plots will look consistent!

Think of it like setting your phone’s wallpaper—you set it once, and it applies everywhere.

Part 2: Loading and Exploring Data

Step 3: Load Census Data from URL

Real datasets often live on the web. The readr package (part of tidyverse) can read directly from URLs—no download required!

# Define the data source URL
url <- "https://raw.githubusercontent.com/tongakuot/r_tutorials/main/06-data-wrangling/00-input/ss_2008_census_data_raw.csv"

# Read the CSV file into R
# read_csv() is from the readr package (part of tidyverse)
census_raw <- read_csv(url, show_col_types = FALSE)

# Display basic information about the dataset
cat("✅ Data loaded successfully!\n")

✅ Data loaded successfully!

cat("📋 Dimensions:", nrow(census_raw), "rows ×", ncol(census_raw), "columns\n")

📋 Dimensions: 453 rows × 10 columns

read_csv() vs read.csv()

R has two main functions for reading CSV files:

read.csv() - Base R function (slower, creates factors by default)
read_csv() - Tidyverse function (faster, better default behavior)

We use read_csv() because it’s faster and works better with tidyverse workflows!

Step 4: Examine the Data Structure

Before cleaning, we must understand our data. What columns exist? What types are they? Let’s investigate!

# glimpse() is like str() but more readable
# It shows: column names, data types, and first few values
glimpse(census_raw)

Rows: 453
Columns: 10
$ Region              <chr> "KN.A2", "KN.A2", "KN.A2", "KN.A2", "KN.A2", "KN.A…
$ `Region Name`       <chr> "Upper Nile", "Upper Nile", "Upper Nile", "Upper N…
$ `Region - RegionId` <chr> "SS-NU", "SS-NU", "SS-NU", "SS-NU", "SS-NU", "SS-N…
$ Variable            <chr> "KN.B2", "KN.B2", "KN.B2", "KN.B2", "KN.B2", "KN.B…
$ `Variable Name`     <chr> "Population, Total (Number)", "Population, Total (…
$ Age                 <chr> "KN.C1", "KN.C2", "KN.C3", "KN.C4", "KN.C5", "KN.C…
$ `Age Name`          <chr> "Total", "0 to 4", "5 to 9", "10 to 14", "15 to 19…
$ Scale               <chr> "units", "units", "units", "units", "units", "unit…
$ Units               <chr> "Persons", "Persons", "Persons", "Persons", "Perso…
$ `2008`              <dbl> 964353, 150872, 151467, 126140, 103804, 82588, 767…

Understanding glimpse() Output

The output shows: - Rows and Columns count at the top - Each line shows: column_name <data_type> first_few_values

Common data types: - <chr> = character (text) - <dbl> = double (decimal numbers) - <int> = integer (whole numbers) - <lgl> = logical (TRUE/FALSE)

Step 5: Preview the Data

Let’s look at the actual data to see what we’re working with:

# head() shows the first n rows (default = 6, we're asking for 10)
census_raw |> 
  head(10)

# A tibble: 10 × 10
   Region `Region Name` `Region - RegionId` Variable `Variable Name`       Age  
   <chr>  <chr>         <chr>               <chr>    <chr>                 <chr>
 1 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C1
 2 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C2
 3 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C3
 4 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C4
 5 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C5
 6 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C6
 7 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C7
 8 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C8
 9 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C9
10 KN.A2  Upper Nile    SS-NU               KN.B2    Population, Total (N… KN.C…
# ℹ 4 more variables: `Age Name` <chr>, Scale <chr>, Units <chr>, `2008` <dbl>

The Pipe Operator |>

The pipe operator |> means “take this and then do that”. It makes code read like a sentence:

# Without pipe (hard to read)
head(census_raw, 10)

# With pipe (reads left to right)
census_raw |> head(10)

Read it as: “Take census_raw, THEN show me the head (first rows)”

Note: You might also see %>% (magrittr pipe) in older code—they work the same way!

Part 3: Data Cleaning

Raw data is rarely analysis-ready. We need to clean and standardize it first!

Step 6 (a): Clean and Transform the Dataset

We’ll perform several cleaning operations in one pipeline using the pipe operator |>:

census_clean <- census_raw |>
  
  # Step 1: Standardize column names
  # Convert to lowercase and replace spaces with underscores
  rename_with(~ str_to_lower(.) |> str_replace_all(" ", "_")) |>
  
  # Step 2: Rename columns to meaningful names
  select(
    state = region_name,      # Rename region_name to state
    gender = variable_name,   # Rename variable_name to gender  
    age_category = age_name,  # Rename age_name to age_category
    population = `2008`       # Rename 2008 to population
  ) |>
  
  # Step 3: Clean character (text) columns
  mutate(
    # Remove extra whitespace and convert to title case
    state = str_squish(state) |> str_to_title(),
    gender = str_squish(gender) |> str_to_title(),
    age_category = str_squish(age_category)
  ) |>
  
  # Step 4: Ensure population is numeric (integer)
  mutate(population = as.integer(population)) |>
  
  # Step 5: Remove rows with missing or invalid data
  filter(!is.na(population), population > 0)

# Display results
cat("✅ Data cleaning complete!\n")

✅ Data cleaning complete!

cat("📊 Cleaned dataset:", nrow(census_clean), "rows ×", ncol(census_clean), "columns\n")

📊 Cleaned dataset: 450 rows × 4 columns

Step 6 (b): Alternative Cleaning with Janitor Package

Here’s a more efficient approach using the janitor package:

janitor_df <- 
  census_raw |>

  # Step 1: Standardize column names automatically
  # Convert to lowercase and replace spaces with underscores - janitor::clean_names()
   clean_names() |> 
  
  # Step 2: Rename columns to meaningful names
  select(
    state = region_name,      # Rename region_name to state
    gender = variable_name,   # Rename variable_name to gender  
    age_category = age_name,  # Rename age_name to age_category
    population = x2008        # Rename x2008 to population
  ) |>
  
  # Step 3: Clean character (text) columns
  mutate(
    across(where(is.character), \(x) str_to_title(x))
  ) |>
  
  # Step 4: Ensure population is numeric (integer)
  mutate(population = as.integer(population)) |>
  
  # Step 5: Remove rows with missing or invalid data
  filter(!is.na(population), population > 0)


# Display results
cat("✅ Data cleaning complete!\n")

✅ Data cleaning complete!

cat("📊 Cleaned dataset:", nrow(janitor_df), "rows ×", ncol(janitor_df), "columns\n")

📊 Cleaned dataset: 450 rows × 4 columns

Understanding Both Cleaning Pipelines

Let’s break down what each step does in both approaches:

Step 1: Standardizing Column Names

Method 6(a) - Manual with rename_with(): - rename_with() applies a function to ALL column names - str_to_lower() converts to lowercase: “State” → “state”
- str_replace_all() replaces spaces: “Age Category” → “age_category” - You control exactly what transformations happen

Method 6(b) - Automatic with clean_names(): - clean_names() from janitor package does everything at once: - Converts to lowercase: “Region Name” → “region_name” - Replaces spaces with underscores: “Age Category” → “age_category” - Removes special characters: “2008” → “x2008” (adds ‘x’ to numeric names) - Makes names unique and R-friendly - One function, multiple benefits!

Step 2: Renaming Columns with `select()`

Both methods do this identically: - Chooses which columns to keep - Renames while selecting: new_name = old_name - Drops columns we don’t need - Note: In 6(b), we rename x2008 instead of 2008 (because clean_names() added the ‘x’)

Step 3: Cleaning Text Columns with `mutate()`

Method 6(a) - One Column at a Time:

mutate(
  state = str_squish(state) |> str_to_title(),
  gender = str_squish(gender) |> str_to_title(),
  age_category = str_squish(age_category)
)

Each column is cleaned individually
str_squish() removes extra spaces: “South Sudan” → “South Sudan”
str_to_title() capitalizes words: “JUBA” → “Juba”
More typing, but explicit control over each column

Method 6(b) - All at Once with across():

mutate(
  across(where(is.character), \(x) str_to_title(x))
)

across() applies a function to multiple columns at once
where(is.character) automatically selects only text columns
\(x) is a shorthand function (lambda) that takes each column (x)
Applies str_to_title() to capitalize: “new york” → “New York”
Less typing, works on any number of text columns!
Note: This doesn’t use str_squish(), assuming data is already clean

Step 4: Converting Population to Integer

Both methods use as.integer(): - Converts to whole numbers only (no decimals) - Makes sense for population - you can’t have 3.5 people! - Uses less memory than as.numeric() - Ensures data integrity for counting

Why not as.numeric()? - as.numeric() allows decimals: 1234.5 - Population should be whole numbers: 1234 - as.integer() is the correct choice for count data

Step 5: Filtering Out Bad Data

Both methods use the same filter: - !is.na(population) removes rows where population is missing (NA) - population > 0 removes rows where population is zero or negative - Only keeps valid, positive population counts

When to Use Each Method

Use Method 6(a) when: - You need precise control over each transformation - You’re learning the fundamentals of data cleaning - Different columns need different cleaning steps - You want to see exactly what’s happening at each step

Use Method 6(b) when: - You have many similar columns to clean - You want cleaner, more concise code - Column names are messy and need standardization - You’re comfortable with more advanced R techniques

The Result: Both methods produce identical, analysis-ready datasets! The choice depends on your preference for explicit control vs. efficiency.

Step 7: Preview the Cleaned Data

Let’s see how our cleaned data looks now:

# View first 10 rows of cleaned data
census_clean |> 
  head(10)

# A tibble: 10 × 4
   state      gender                     age_category population
   <chr>      <chr>                      <chr>             <int>
 1 Upper Nile Population, Total (Number) Total            964353
 2 Upper Nile Population, Total (Number) 0 to 4           150872
 3 Upper Nile Population, Total (Number) 5 to 9           151467
 4 Upper Nile Population, Total (Number) 10 to 14         126140
 5 Upper Nile Population, Total (Number) 15 to 19         103804
 6 Upper Nile Population, Total (Number) 20 to 24          82588
 7 Upper Nile Population, Total (Number) 25 to 29          76754
 8 Upper Nile Population, Total (Number) 30 to 34          63134
 9 Upper Nile Population, Total (Number) 35 to 39          56806
10 Upper Nile Population, Total (Number) 40 to 44          42139

Data Cleaning Checklist

When cleaning data, always:

✅ Standardize names - Use consistent naming (lowercase, underscores)
✅ Remove whitespace - Trim extra spaces that cause problems
✅ Fix data types - Numbers should be numeric, not text
✅ Handle missing values - Decide: remove, replace, or keep?
✅ Check for duplicates - Remove or investigate unusual patterns

Our pipeline handles all of these!

Part 3B: Advanced String Processing

Now we need to extract more meaningful information from our data. The gender column actually contains structured text like “Population - Male (Number)” that we can parse!

Step 8: Examine the Gender Column Structure

Let’s look at what values exist in the gender column:

# See unique values in gender column
cat("🔍 Unique values in gender column:\n")

🔍 Unique values in gender column:

census_clean |> 
  distinct(gender) |> 
  pull(gender)

[1] "Population, Total (Number)"  "Population, Male (Number)"  
[3] "Population, Female (Number)"

String Structure Analysis

The gender column follows a pattern: “Population - Gender (Type)”

Examples: - “Population - Male (Number)” - “Population - Female (Number)” - “Population - Total (Number)”

We want to extract just the middle piece: Male, Female, or Total

This is called string splitting or text parsing!

Step 9: Extract Gender Information (Multiple Methods)

R provides several ways to extract information from text. Let’s explore different approaches!

Method 1: Using str_split() with List Extraction

The simplest approach: split on a delimiter and extract the piece you want.

# Split on " " and extract the second piece
method_1_example <- census_clean |>
  select(original_gender = gender) |>
  mutate(
    # Split creates a list, str_split_i gets the i-th piece
    gender_extracted = str_split_i(original_gender, " ", 2)
  ) |>
  distinct()

cat("✅ Method 1: Split on ' ' and extract 2nd piece\n")

✅ Method 1: Split on ' ' and extract 2nd piece

method_1_example

# A tibble: 3 × 2
  original_gender             gender_extracted
  <chr>                       <chr>           
1 Population, Total (Number)  Total           
2 Population, Male (Number)   Male            
3 Population, Female (Number) Female

How str_split_i() Works

str_split_i(string, pattern, i) breaks text at a delimiter:

string: The text to split
pattern: What to split on (e.g., ” - “)
i: Which piece to extract (1 = first, 2 = second, etc.)

Example:

"Population - Male (Number)" 
  → split on " " 
  → ["Population", "Male (Number)"]
  → extract 2nd piece 
  → "Male (Number)"

We still need to remove the “(Number)” part!

Method 2: Chain Multiple String Operations

Sometimes you need multiple steps to clean extracted text:

# Split twice: first on " ", then remove parenthetical
method_2_example <- census_clean |>
  select(original_gender = gender) |>
  mutate(
    # Step 1: Extract middle section
    temp = str_split_i(original_gender, " ", 2),
    # Step 2: Remove everything from " (" onwards
    gender_clean = str_remove(temp, " \\(.*\\)")
  ) |>
  select(-temp) |>
  distinct()

cat("✅ Method 2: Chained operations for clean extraction\n")

✅ Method 2: Chained operations for clean extraction

method_2_example

# A tibble: 3 × 2
  original_gender             gender_clean
  <chr>                       <chr>       
1 Population, Total (Number)  Total       
2 Population, Male (Number)   Male        
3 Population, Female (Number) Female

Understanding str_remove()

str_remove(string, pattern) deletes matching text:

Pattern: \\(.*\\) is a regular expression meaning:
- \\( = literal space and opening parenthesis
- .* = any characters (zero or more)
- \\) = literal closing parenthesis

Result: Removes ” (Number)” from “Male (Number)” → “Male”

Method 3: Using Regular Expressions (Regex)

For complex patterns, regex provides powerful extraction:

# Use regex pattern to extract text between " - " and " ("
method_3_example <- census_clean |>
  select(original_gender = gender) |>
  mutate(
    gender_regex = str_extract(original_gender, "(?<= ).*(?= \\()")
  ) |>
  distinct()

cat("✅ Method 3: Regex pattern matching\n")

✅ Method 3: Regex pattern matching

method_3_example

# A tibble: 3 × 2
  original_gender             gender_regex
  <chr>                       <chr>       
1 Population, Total (Number)  Total       
2 Population, Male (Number)   Male        
3 Population, Female (Number) Female

Understanding the Regex Pattern

"(?<= ).*(?= \\()" uses lookaround assertions:

Pattern breakdown: - (?<= ) = Lookbehind: Must be preceded by ” - ” (but don’t include it) - .* = Match: Any characters (this is what we capture) - (?= \\() = Lookahead: Must be followed by ” (” (but don’t include it)

In plain English: “Find text that comes after ’ - ’ and before ’ (’”

Example: “Population - Male (Number)” - After ” - “:”Male (Number)” - Before ” (“:”Male” - Captured: “Male” ✅

Method 4: Using str_remove() to Delete Unwanted Parts

Remove prefix and suffix to isolate what you need:

# Remove the prefix and suffix
method_4_example <- census_clean |>
  select(original_gender = gender) |>
  mutate(
    gender_replace = original_gender |>
      str_remove("Population, ") |>      # Remove prefix
      str_remove(" \\(Number\\)")          # Remove suffix
  ) |>
  distinct()

cat("✅ Method 4: String removal approach\n")

✅ Method 4: String removal approach

method_4_example

# A tibble: 3 × 2
  original_gender             gender_replace
  <chr>                       <chr>         
1 Population, Total (Number)  Total         
2 Population, Male (Number)   Male          
3 Population, Female (Number) Female

Method 5: Complete Solution in One Pipeline

Now let’s apply the cleanest method to our actual dataset:

# Apply gender extraction to our dataset
census_parsed <- census_clean |>
  mutate(
    # Extract gender: split on " ", take 2nd piece, remove parenthetical
    gender = str_split_i(gender, " ", 2) |>
             str_remove(" \\(.*\\)") |>
             str_squish()  # Remove any extra whitespace
  )

# Verify extraction worked
cat("✅ Gender extraction complete!\n")

✅ Gender extraction complete!

cat("🎯 Unique gender values:\n")

🎯 Unique gender values:

census_parsed |> 
  distinct(gender) |> 
  pull(gender)

[1] "Total"  "Male"   "Female"

Which Method Should You Use?

Choose based on your needs:

Method	Best For	Pros	Cons
Method 1	Simple, consistent patterns	Fast, readable	May need cleanup
Method 2	Multi-step cleaning	Very clear logic	More verbose
Method 3	Complex patterns	Most powerful	Requires regex knowledge
Method 4	Known prefix/suffix	Intuitive	Less flexible
Method 5	Production code	Clean, efficient	Combines multiple concepts

For this tutorial: Method 5 is ideal—it’s clean, efficient, and production-ready!

Method 6: Using separate_wider_delim() (Modern Tidyr Approach)

The separate_wider_delim() function from tidyr provides a clean, declarative way to split columns into multiple pieces. This is a modern approach that’s perfect for structured text!

# Demonstrate separate_wider_delim()
method_6_example <- census_clean |>
  select(original_gender = gender) |>
  
  # Step 1: Split into three parts by " - "
  separate_wider_delim(
    cols = original_gender,
    delim = " ",
    names = c("prefix", "gender_raw", "suffix"),
    too_few = "align_start"  # Handle any rows with fewer delimiters
  ) |>
  
  # Step 2: Clean the extracted gender piece
  mutate(
    gender_clean = str_remove(gender_raw, " \\(.*\\)") |>
                   str_squish()
  ) |>
  
  # Show the transformation
  select(prefix, gender_raw, gender_clean) |>
  distinct()

cat("✅ Method 6: separate_wider_delim() approach\n")

✅ Method 6: separate_wider_delim() approach

method_6_example

# A tibble: 3 × 3
  prefix      gender_raw gender_clean
  <chr>       <chr>      <chr>       
1 Population, Total      Total       
2 Population, Male       Male        
3 Population, Female     Female

Understanding separate_wider_delim()

separate_wider_delim() is designed specifically for splitting one column into multiple columns:

Structure:

separate_wider_delim(
  cols = column_to_split,
  delim = "delimiter",
  names = c("col1", "col2", "col3"),
  too_few = "align_start"  # What to do if fewer pieces than expected
)

Key parameters: - cols: Which column to split - delim: What to split on (our case: ” - “) - names: Names for the new columns created - too_few: How to handle rows with fewer delimiters than expected

What happens to our data:

Before:

original_gender: "Population - Male (Number)"

After:

prefix: "Population"
gender_raw: "Male (Number)"
suffix: "" (empty - no third piece)

Then we clean gender_raw to remove “(Number)”!

separate_wider_delim() vs str_split_i()

Use separate_wider_delim() when: - You want to keep multiple pieces from the split - You need named columns for each piece - You want declarative, readable code - You’re doing data reshaping as part of tidying

Use str_split_i() when: - You only need one specific piece - You want fewer intermediate columns - You prefer a more compact solution - You’re doing quick transformations

Example comparison:

# separate_wider_delim: Keep all pieces
data |> separate_wider_delim(col, " ", c("a", "b", "c"))
# Result: Three new columns (a, b, c)

# str_split_i: Extract one piece
data |> mutate(b = str_split_i(col, " ", 2))
# Result: One new column (b)

For our final dataset: We use str_split_i() (Method 5) because we only need the middle piece. But separate_wider_delim() is excellent when you need multiple pieces!

Comparison: All Six Methods Side by Side

Let’s see all approaches and their results:

# Create comparison table
comparison <- census_clean |>
  head(3) |>
  select(original = gender) |>
  mutate(
    method_1_split_i = str_split_i(original, " ", 2),
    method_2_chained = str_split_i(original, " ", 2) |>
                      str_remove(" \\(.*\\)"),
    method_3_regex = str_extract(original, "(?<= ).*(?= \\()"),
    method_4_remove = original |>
                     str_remove("Population, ") |>
                     str_remove(" \\(Number\\)"),
    method_5_final = str_split_i(original, " ", 2) |>
                    str_remove(" \\(.*\\)") |>
                    str_squish()
  )

# Add method 6 separately (separate_wider_delim works differently)
comparison_method_6 <- census_clean |>
  head(3) |>
  select(original = gender) |>
  separate_wider_delim(
    cols = original,
    delim = " ",
    names = c("prefix", "gender_raw", "suffix"),
    too_few = "align_start"
  ) |>
  mutate(
    method_6_separate = str_remove(gender_raw, " \\(.*\\)") |> str_squish()
  ) |>
  pull(method_6_separate)

# Combine and display
comparison |>
  mutate(method_6_separate = comparison_method_6) |>
  gt() |>
  tab_header(
    title = md("**Comparison of All Six Methods**"),
    subtitle = "Different approaches to extract 'Male' from 'Population - Male (Number)'"
  ) |>
  tab_style(
    style = cell_text(size = px(10)),
    locations = cells_body()
  ) |>
  tab_style(
    style = cell_fill(color = "#f8f9fa"),
    locations = cells_body(columns = c(method_5_final, method_6_separate))
  ) |>
  tab_footnote(
    footnote = "Methods 5 and 6 (highlighted) produce clean output ready for analysis",
    locations = cells_column_labels(columns = c(method_5_final, method_6_separate))
  ) |>
  cols_label(
    original = "Original",
    method_1_split_i = "M1: split_i",
    method_2_chained = "M2: chained",
    method_3_regex = "M3: regex",
    method_4_remove = "M4: remove",
    method_5_final = "M5: final",
    method_6_separate = "M6: separate"
  )

Original	M1: split_i	M2: chained	M3: regex	M4: remove	M5: final¹	M6: separate¹
Comparison of All Six Methods
Different approaches to extract 'Male' from 'Population - Male (Number)'
Population, Total (Number)	Total	Total	Total	Total	Total	Total
Population, Total (Number)	Total	Total	Total	Total	Total	Total
Population, Total (Number)	Total	Total	Total	Total	Total	Total
¹ Methods 5 and 6 (highlighted) produce clean output ready for analysis

Choosing Your Method: Decision Tree

Start here: What do you need?

Need multiple pieces from the split?
- Yes → Use Method 6: separate_wider_delim()
- No → Continue to #2
Is the pattern very complex (multiple conditions)?
- Yes → Use Method 3: Regex str_extract()
- No → Continue to #3
Do you know exact prefix/suffix to remove?
- Yes → Use Method 4: str_remove()
- No → Continue to #4
Need just one piece from a split?
- Yes → Use Method 5: str_split_i() with cleanup ✅ (Recommended for our case)
Want to see intermediate steps for debugging?
- Yes → Use Method 2: Chained operations
- No → Use Method 5 (most efficient)

For learning: Try all methods!
For production: Use Method 5 or 6 (cleanest, most maintainable)

Step 10: Recategorize Age Groups

Now let’s standardize the age categories into broader, more meaningful groups.

First, let’s see what age categories we currently have:

cat("🔍 Current age categories:\n")

🔍 Current age categories:

census_parsed |>
  dplyr::distinct(age_category) |>
  dplyr::arrange(age_category) |>
  dplyr::pull(age_category)

 [1] "0 to 4"   "10 to 14" "15 to 19" "20 to 24" "25 to 29" "30 to 34"
 [7] "35 to 39" "40 to 44" "45 to 49" "5 to 9"   "50 to 54" "55 to 59"
[13] "60 to 64" "65+"      "Total"

Why Recategorize Age Groups?

Original data: Fine-grained 5-year age bands (0-4, 5-9, etc.)

Problem: - Too many categories for high-level analysis - Harder to spot trends - Difficult to compare with other datasets

Solution: Group into broader categories: - 0-14: Children - 15-24: Youth/Young adults - 25-34: Early working age - 35-44: Middle working age - 45-54: Later working age - 55-64: Pre-retirement - 65+: Retirement age

This is called binning or categorization!

Method 1: Using case_when() for Conditional Recategorization

The case_when() function is perfect for complex, multi-condition transformations:

census_final <- census_parsed |>
  mutate(
    age_category = case_when(
      # Children (0-14)
      age_category %in% c("0 to 4", "5 to 9", "10 to 14") ~ "0-14",
      
      # Youth (15-24)
      age_category %in% c("15 to 19", "20 to 24") ~ "15-24",
      
      # Early working age (25-34)
      age_category %in% c("25 to 29", "30 to 34") ~ "25-34",
      
      # Middle working age (35-44)
      age_category %in% c("35 to 39", "40 to 44") ~ "35-44",
      
      # Later working age (45-54)
      age_category %in% c("45 to 49", "50 to 54") ~ "45-54",
      
      # Pre-retirement (55-64)
      age_category %in% c("55 to 59", "60 to 64") ~ "55-64",
      
      # Retirement age (65+)
      age_category == "65+" ~ "65+",
      
      # Catch any unexpected values
      TRUE ~ age_category
    )
  )

# Verify recategorization
cat("✅ Age categories recategorized!\n")

✅ Age categories recategorized!

cat("🎯 New age categories:\n")

🎯 New age categories:

census_final |> 
  distinct(age_category) |> 
  arrange(age_category) |>
  pull(age_category)

[1] "0-14"  "15-24" "25-34" "35-44" "45-54" "55-64" "65+"   "Total"

Understanding case_when()

case_when() is like a multi-way IF statement (similar to SQL’s CASE WHEN):

Structure:

case_when(
  condition1 ~ result1,  # If condition1 is TRUE, return result1
  condition2 ~ result2,  # Else if condition2 is TRUE, return result2
  condition3 ~ result3,  # Else if condition3 is TRUE, return result3
  TRUE ~ default         # Else return default (catch-all)
)

Key points: - Conditions are evaluated in order (first match wins) - %in% checks if value is in a vector (like “is one of”) - ~ separates condition from result - TRUE ~ at the end catches anything not matched above

Example for our data: - If age is “0 To 4” OR “5 To 9” OR “10 To 14” → return “0-14” - Else if age is “15 To 19” OR “20 To 24” → return “15-24” - And so on…

Method 2: Using Named Vector with Lookup (Alternative)

For simple 1-to-1 mappings, a named vector can work:

# Create mapping vector
age_mapping <- c(
  "0 to 4" = "0-14", "5 to 9" = "0-14", 
  "10 to 14" = "0-14", "15 to 19" = "15-24", 
  "20 to 24" = "15-24", "25 to 29" = "25-34", 
  "30 to 34" = "25-34", "35 to 39" = "35-44", 
  "40 to 44" = "35-44", "45 to 49" = "45-54", 
  "50 to 54" = "45-54", "55 to 59" = "55-64", 
  "60 to 64" = "55-64", "65+" = "65+"
)

# Demonstrate lookup (not applied to dataset)
demo_recode <- census_parsed |>
  select(age_category) |>
  mutate(
    age_category_alt = age_mapping[age_category]
  ) |>
  distinct()

cat("✅ Method 2 demonstration:\n")

✅ Method 2 demonstration:

demo_recode

# A tibble: 15 × 2
   age_category age_category_alt
   <chr>        <chr>           
 1 Total        <NA>            
 2 0 to 4       0-14            
 3 5 to 9       0-14            
 4 10 to 14     0-14            
 5 15 to 19     15-24           
 6 20 to 24     15-24           
 7 25 to 29     25-34           
 8 30 to 34     25-34           
 9 35 to 39     35-44           
10 40 to 44     35-44           
11 45 to 49     45-54           
12 50 to 54     45-54           
13 55 to 59     55-64           
14 60 to 64     55-64           
15 65+          65+

Comparing case_when() vs Named Vector

Use case_when() when: - Multiple conditions per category - Complex logic (AND/OR operations) - Need to explain your logic clearly - Best for our use case ✅

Use named vector when: - Simple 1-to-1 replacements - Large number of mappings - Mapping stored separately from code

Both work, but case_when() is more readable and maintainable for conditional logic!

Step 11: Verify Final Cleaned Dataset

Step 11 (a): Removing Total Rows Using `filter()`

# Filter out rows where gender or age_category contain "Total"
# This removes aggregate/summary rows, keeping only individual demographic categories
census_filtered <- census_final |> 
  filter(gender != "Total", age_category != "Total")

# Display the filtered dataset
census_filtered

# A tibble: 280 × 4
   state      gender age_category population
   <chr>      <chr>  <chr>             <int>
 1 Upper Nile Male   0-14              82690
 2 Upper Nile Male   0-14              83744
 3 Upper Nile Male   0-14              71027
 4 Upper Nile Male   15-24             57387
 5 Upper Nile Male   15-24             42521
 6 Upper Nile Male   25-34             38795
 7 Upper Nile Male   25-34             32236
 8 Upper Nile Male   35-44             30228
 9 Upper Nile Male   35-44             22290
10 Upper Nile Male   45-54             18163
# ℹ 270 more rows

Step 11 (b): Alternative Filtering Approach with OR Logic

# Alternative method: Remove "Total" values using explicit OR logic
# Useful when you want to clearly see the filtering conditions
# Note: This approach is not used in subsequent analyses - shown for demonstration only
filtered_df <- census_final |> 
  filter(!((gender == "Total") | (age_category == "Total")))

# Display the filtered dataset
filtered_df

# A tibble: 280 × 4
   state      gender age_category population
   <chr>      <chr>  <chr>             <int>
 1 Upper Nile Male   0-14              82690
 2 Upper Nile Male   0-14              83744
 3 Upper Nile Male   0-14              71027
 4 Upper Nile Male   15-24             57387
 5 Upper Nile Male   15-24             42521
 6 Upper Nile Male   25-34             38795
 7 Upper Nile Male   25-34             32236
 8 Upper Nile Male   35-44             30228
 9 Upper Nile Male   35-44             22290
10 Upper Nile Male   45-54             18163
# ℹ 270 more rows

Let’s confirm our data cleaning and transformations worked correctly:

# Generate a comprehensive summary of the cleaned and transformed dataset
# This verification step ensures all transformations were applied correctly

cat(strrep("=", 50), "\n", sep = "")

==================================================

cat("🎉 DATA TRANSFORMATION COMPLETE!\n")

🎉 DATA TRANSFORMATION COMPLETE!

cat(strrep("=", 50), "\n", sep = "")

==================================================

# Display dimensions of the final cleaned dataset
cat("📊 Final dataset dimensions:\n")

📊 Final dataset dimensions:

cat("  Rows:", nrow(census_filtered), "\n")

  Rows: 280

cat("  Columns:", ncol(census_filtered), "\n\n")

  Columns: 4

# List all column names in the cleaned dataset
cat("✅ Column names:\n")

✅ Column names:

cat("  ", paste(names(census_final), collapse = ", "), "\n\n")

   state, gender, age_category, population

# Show all unique gender categories (should only be "Male", "Female", and "Total" before filtering)
cat("🎯 Unique gender values:\n")

🎯 Unique gender values:

census_filtered |> distinct(gender) |> pull(gender) |> cat("  ", "\n")

Male Female

# Show all unique age categories in sorted order
cat("\n🎯 Unique age categories:\n")


🎯 Unique age categories:

census_filtered |> distinct(age_category) |> arrange(age_category) |> pull(age_category) |> cat("  ", "\n")

0-14 15-24 25-34 35-44 45-54 55-64 65+

# Display first 10 rows of key columns to visually inspect the data
cat("\n📋 Sample of final data:\n")


📋 Sample of final data:

census_filtered |> 
  select(state, gender, age_category, population) |>
  head(10)

# A tibble: 10 × 4
   state      gender age_category population
   <chr>      <chr>  <chr>             <int>
 1 Upper Nile Male   0-14              82690
 2 Upper Nile Male   0-14              83744
 3 Upper Nile Male   0-14              71027
 4 Upper Nile Male   15-24             57387
 5 Upper Nile Male   15-24             42521
 6 Upper Nile Male   25-34             38795
 7 Upper Nile Male   25-34             32236
 8 Upper Nile Male   35-44             30228
 9 Upper Nile Male   35-44             22290
10 Upper Nile Male   45-54             18163

Data Transformation Summary

What we accomplished:

Original gender column values: - “Population - Male (Number)” - “Population - Female (Number)” - “Population - Total (Number)”

Transformed to clean format: - “Male” - “Female” - “Total”

Original age_category column structure: - 16 individual 5-year age bands - Examples: “0 To 4”, “5 To 9”, “10 To 14”, …, “65+”

Transformed to standardized life-stage categories: - 7 broader, more interpretable age groups - “0-14”, “15-24”, “25-34”, “35-44”, “45-54”, “55-64”, “65+”

Result: A clean, standardized dataset ready for analysis and visualization! ✨

Data Cleaning Best Practices Checklist

Follow these essential steps when cleaning any dataset:

✅ Standardize column names - Use consistent formatting (lowercase, underscores, no spaces)
✅ Remove unnecessary whitespace - Trim leading/trailing spaces that cause matching errors
✅ Ensure correct data types - Verify numeric data is stored as numbers, not text
✅ Address missing values - Decide upfront: remove rows, replace with values, or keep as-is?
✅ Identify and handle duplicates - Remove exact duplicates or investigate patterns
✅ Remove aggregate rows - Filter out summary/total rows that skew analysis

Our data pipeline addresses all of these considerations!

Part 4: Data Exploration and Summary

Step 12: Create Overview Statistics

Let’s calculate some key statistics about our dataset:

# Create a summary table
overview_table <- census_filtered |>
  summarise(
    `Total Population` = comma(sum(population)),     # Format with commas
    `Number of States` = n_distinct(state),          # Count unique states
    `Age Categories` = n_distinct(age_category),     # Count unique ages
    `Gender Groups` = n_distinct(gender),            # Count unique genders
    `Total Observations` = comma(n())                # Count all rows
  )

# Display the summary
overview_table

# A tibble: 1 × 5
  `Total Population` `Number of States` `Age Categories` `Gender Groups`
  <chr>                           <int>            <int>           <int>
1 8,260,490                          10                7               2
# ℹ 1 more variable: `Total Observations` <chr>

Understanding summarise()

summarise() collapses data into summary statistics:

sum() - adds up values
n_distinct() - counts unique values
n() - counts total rows
comma() - formats numbers with commas (from scales package)

It reduces many rows into one row of summaries!

Step 13: Display as Professional Table

Now let’s make this summary look professional using the gt package:

overview_table |>
  gt() |>
  tab_header(
    title = md("**South Sudan 2008 Census Overview**"),
    subtitle = "Key Summary Statistics"
  ) |>
  tab_style(
    style = cell_fill(color = "#22d3ee"),
    locations = cells_body()
  ) |>
  tab_style(
    style = cell_text(color = "white", weight = "bold"),
    locations = cells_body()
  ) |>
  tab_options(
    table.font.size = px(13),
    heading.title.font.size = px(16),
    heading.subtitle.font.size = px(12)
  )

Total Population	Number of States	Age Categories	Gender Groups	Total Observations
South Sudan 2008 Census Overview
Key Summary Statistics
8,260,490	10	7	2	280

The Grammar of Tables (gt)

The gt package uses a layered approach (like ggplot2 for tables):

Start with data → gt()
Add headers → tab_header()
Style cells → tab_style()
Format numbers → fmt_number()
Adjust options → tab_options()

Each layer adds or modifies the table appearance!

Part 5: Gender Analysis

Step 14: Calculate National Gender Distribution

Let’s analyze how the population is distributed by gender:

gender_summary <- census_filtered |>
  
  # Step 1: Group data by gender
  group_by(gender) |>
  
  # Step 2: Calculate total population for each gender
  summarise(
    population = sum(population),
    .groups = "drop"  # Remove grouping after summarise
  ) |>
  
  # Step 3: Calculate percentages
  mutate(
    percentage = population / sum(population) * 100,
    percentage_label = percent(percentage / 100, accuracy = 0.01)
  ) |>
  
  # Step 4: Sort by population (largest first)
  arrange(desc(population))

# Display the results
gender_summary

# A tibble: 2 × 4
  gender population percentage percentage_label
  <chr>       <int>      <dbl> <chr>           
1 Male      4287300       51.9 51.90%          
2 Female    3973190       48.1 48.10%

Understanding group_by() and summarise()

These two functions work together like a team:

group_by(gender) - Splits data into groups (one for Male, one for Female) - Like separating cards into piles

summarise(population = sum(population)) - Performs calculations within each group - sum() adds up all population values in each group - Like counting cards in each pile

.groups = "drop" - Removes the grouping after we’re done - Prevents unexpected behavior in future operations

Final result: One row per gender with total population!

Step 15: Display Gender Table

gender_summary |>
  # Rename columns for display
  select(
    Gender = gender, 
    Population = population, 
    `Percentage` = percentage_label
  ) |>
  
  # Create gt table
  gt() |>
  
  # Add title and subtitle
  tab_header(
    title = md("**National Gender Distribution**"),
    subtitle = "South Sudan 2008 Census"
  ) |>
  
  # Format population with commas
  fmt_number(
    columns = Population,
    decimals = 0,
    use_seps = TRUE
  ) |>
  
  # Style Male row (first row) - cyan background
  tab_style(
    style = list(
      cell_fill(color = "#22d3ee"),
      cell_text(color = "white", weight = "bold")
    ),
    locations = cells_body(rows = 1)
  ) |>
  
  # Style Female row (second row) - gold background
  tab_style(
    style = list(
      cell_fill(color = "#FFD700"),
      cell_text(color = "#000000", weight = "bold")
    ),
    locations = cells_body(rows = 2)
  ) |>
  
  # Center all columns
  cols_align(align = "center", columns = everything()) |>
  
  # Adjust font sizes
  tab_options(
    table.font.size = px(13),
    heading.title.font.size = px(16)
  )

Gender	Population	Percentage
National Gender Distribution
South Sudan 2008 Census
Male	4,287,300	51.90%
Female	3,973,190	48.10%

Step 16: Visualize Gender Distribution

Numbers are great, but visualizations make patterns instantly clear. Let’s create a pie chart:

ggplot(gender_summary, aes(x = "", y = population, fill = gender)) +
  
  # Create a bar chart (we'll turn it into a pie)
  geom_col(width = 1, color = "white", linewidth = 2) +
  
  # Convert bar chart to pie chart using polar coordinates
  coord_polar(theta = "y") +
  
  # Set custom colors for Male and Female
  scale_fill_manual(
    values = c(
      "Male" = "#22d3ee",
      "Female" = "#FFD700"
    )
  ) +
  
  # Add labels showing counts and percentages
  geom_text(
    aes(label = glue("{comma(population)}\n({percentage_label})")),
    position = position_stack(vjust = 0.5),  # Center in each slice
    size = 5,
    fontface = "bold",
    color = "white"
  ) +
  
  # Add titles and labels
  labs(
    title = "**Gender Distribution in South Sudan**",
    subtitle = "2008 Census Data",
    fill = "Gender"
  ) +
  
  # Use void theme for pie charts (removes axes)
  theme_void() +
  
  # Customize title and legend
  theme(
    plot.title = element_markdown(
      size = 16, 
      face = "bold", 
      color = "#06b6d4",
      hjust = 0.5,        # Center title
      margin = margin(b = 5)
    ),
    plot.subtitle = element_markdown(
      size = 12, 
      color = "#666666", 
      hjust = 0.5         # Center subtitle
    ),
    legend.position = "bottom",
    legend.title = element_text(face = "bold", size = 11)
  )

Gender distribution shown as a pie chart with population counts and percentages

Anatomy of a ggplot2 Chart

Every ggplot2 visualization follows this pattern:

Start with data → ggplot(data, aes(...))
Add geometry → geom_*() (point, line, bar, etc.)
Adjust scales → scale_*() (colors, axes, etc.)
Add labels → labs() (title, axes, etc.)
Apply theme → theme_*() (appearance)

Think of it like building with LEGO blocks—each layer adds something!

Bonus: coord_polar() transforms rectangular plots into circular ones (bar chart → pie chart)!

Step 17: Gender Distribution by State

Now let’s see how gender distribution varies across different states:

state_gender <- census_filtered |>
  
  # Group by both state AND gender
  group_by(state, gender) |>
  
  # Sum population within each state-gender combination
  summarise(population = sum(population), .groups = "drop") |>
  
  # Reshape from long to wide format
  # Before: Multiple rows per state (one for Male, one for Female)
  # After: One row per state (Male and Female as separate columns)
  pivot_wider(names_from = gender, values_from = population) |>
  
  # Calculate additional metrics
  mutate(
    total = Male + Female,                    # Total population
    male_pct = Male / total * 100,           # Male percentage
    female_pct = Female / total * 100,       # Female percentage
    gender_ratio = Male / Female * 100       # Males per 100 females
  ) |>
  
  # Sort by total population (largest first)
  arrange(desc(total))

# Display top 5 states
state_gender |> 
  head(5)

# A tibble: 5 × 7
  state             Female   Male   total male_pct female_pct gender_ratio
  <chr>              <int>  <int>   <int>    <dbl>      <dbl>        <dbl>
1 Jonglei           624275 734327 1358602     54.1       45.9        118. 
2 Central Equatoria 521835 581722 1103557     52.7       47.3        111. 
3 Warrap            502194 470734  972928     48.4       51.6         93.7
4 Upper Nile        438923 525430  964353     54.5       45.5        120. 
5 Eastern Equatoria 440974 465187  906161     51.3       48.7        105.

Understanding pivot_wider()

pivot_wider() reshapes data from long to wide format:

Before (Long format):

State    Gender  Population
Juba     Male    50000
Juba     Female  48000
Unity    Male    30000
Unity    Female  29000

After (Wide format):

State    Male   Female  Total
Juba     50000  48000   98000
Unity    30000  29000   59000

Why? Because it’s easier to calculate ratios and percentages when Male and Female are in separate columns!

Step 18: Display State Gender Table

state_gender |>
  head(5) |>
  
  # Select and rename columns for display
  select(
    State = state,
    Male,
    Female,
    Total = total,
    `Male %` = male_pct,
    `Female %` = female_pct,
    `Gender Ratio` = gender_ratio
  ) |>
  
  # Create table
  gt(rowname_col = "State") |>
  cols_align(columns = State, align = "right") |> 
  # Add header
  tab_header(
    title = md("**Gender Distribution by State**"),
    subtitle = "Top 10 Most Populous States"
  ) |>
  
  # Format population columns with commas
  fmt_number(
    columns = c(Male, Female, Total),
    decimals = 0,
    use_seps = TRUE
  ) |>
  
  # Format percentage and ratio columns
  fmt_number(
    columns = c(`Male %`, `Female %`, `Gender Ratio`),
    decimals = 2
  ) |>
  
  # Add color gradient to Gender Ratio
  # Values near 100 are balanced (white)
  # Values far from 100 show imbalance (colored)
  data_color(
    columns = `Gender Ratio`,
    palette = c("#FFD700", "#ffffff", "#22d3ee"),
    domain = c(90, 120)
  ) |>
  
  # Highlight State column
  tab_style(
    style = cell_fill(color = "#f8f9fa"),
    locations = cells_body(columns = State)
  ) |>
  
  # Add footnote explaining Gender Ratio
  tab_footnote(
    footnote = "Gender Ratio represents males per 100 females. Values near 100 indicate balance.",
    locations = cells_column_labels(columns = `Gender Ratio`)
  ) |>
  
  # Apply pre-built theme
  gt_theme_538(quiet = TRUE) |>
  
  # Adjust font sizes
  tab_options(
    table.font.size = px(12),
    heading.title.font.size = px(16),
    footnotes.font.size = px(10)
  )

	Male	Female	Total	Male %	Female %	Gender Ratio¹
Gender Distribution by State
Top 10 Most Populous States
Jonglei	734,327	624,275	1,358,602	54.05	45.95	117.63
Central Equatoria	581,722	521,835	1,103,557	52.71	47.29	111.48
Warrap	470,734	502,194	972,928	48.38	51.62	93.74
Upper Nile	525,430	438,923	964,353	54.49	45.51	119.71
Eastern Equatoria	465,187	440,974	906,161	51.34	48.66	105.49
¹ Gender Ratio represents males per 100 females. Values near 100 indicate balance.

Understanding Gender Ratio

Gender Ratio = (Males / Females) × 100

Ratio = 100: Perfect balance (equal males and females)
Ratio > 100: More males than females
Ratio < 100: More females than males

For example: - Ratio of 105 means 105 males per 100 females (5% more males) - Ratio of 95 means 95 males per 100 females (5% fewer males)

Step 19: Visualize State Gender Distribution

state_gender |>
  head(10) |>
  
  # Convert from wide to long format for plotting
  # Need separate rows for Male and Female to create grouped bars
  pivot_longer(
    cols = c(Male, Female),
    names_to = "gender",
    values_to = "population"
  ) |>
  
  # Reorder states by total population for better visualization
  mutate(state = fct_reorder(state, total)) |>
  
  # Create plot
  ggplot(aes(x = state, y = population, fill = gender)) +
  
  # Grouped bar chart (bars side by side)
  geom_col(position = "dodge", alpha = 0.9, width = 0.7) +
  
  # Set colors
  scale_fill_manual(
    values = c(
      "Male" = "#22d3ee",
      "Female" = "#FFD700"
    )
  ) +
  
  # Format y-axis labels (show as "100K" instead of "100000")
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "K")) +
  
  # Flip coordinates (horizontal bars are easier to read)
  coord_flip() +
  
  # Add labels
  labs(
    title = "**Population by State and Gender**",
    subtitle = "Top 10 Most Populous States | South Sudan 2008 Census",
    x = NULL,  # Remove x-axis label (it says "state" which is obvious)
    y = "Population",
    fill = "Gender"
  ) +
  
  # Customize theme
  theme(
    panel.grid.major.y = element_blank(),  # Remove horizontal grid lines
    panel.grid.major.x = element_line(color = "#e5e5e5"),
    legend.position = "top"
  )

Population by state and gender for the top 10 most populous states

Choosing the Right Chart Type

Grouped Bar Chart (what we used): - Best for: Comparing categories across groups - Shows: Exact values for each category - Advantage: Easy to compare Male vs Female within each state

Stacked Bar Chart (alternative): - Best for: Showing part-to-whole relationships - Shows: Total and composition - Advantage: Shows total population at a glance

Why coord_flip()? Long state names are easier to read horizontally than at an angle!

Part 6: Age Category Analysis

Step 20: Calculate National Age Distribution

age_summary <- census_filtered |>
  
  # Group by age category
  group_by(age_category) |>
  
  # Sum population for each age group
  summarise(
    population = sum(population),
    .groups = "drop"
  ) |>
  
  # Calculate percentages
  mutate(
    percentage = population / sum(population) * 100,
    percentage_label = percent(percentage / 100, accuracy = 0.01)
  ) |>
  
  # Sort by population (largest first)
  arrange(desc(population))

# Display results
age_summary

# A tibble: 7 × 4
  age_category population percentage percentage_label
  <chr>             <int>      <dbl> <chr>           
1 0-14            3659337      44.3  44.30%          
2 15-24           1628835      19.7  19.72%          
3 25-34           1234926      14.9  14.95%          
4 35-44            815517       9.87 9.87%           
5 45-54            473365       5.73 5.73%           
6 55-64            237426       2.87 2.87%           
7 65+              211084       2.56 2.56%

Step 21: Display Age Distribution Table

age_summary |>
  select(
    `Age Category` = age_category, 
    Population = population,
    Percentage = percentage_label
  ) |>
  
  # Create table
  gt() |>
  
  # Add header
  tab_header(
    title = md("**Population Distribution by Age Category**"),
    subtitle = "National Summary"
  ) |>
  
  # Format population with commas
  fmt_number(
    columns = Population,
    decimals = 0,
    use_seps = TRUE
  ) |>
  
  # Add color gradient based on population size
  data_color(
    columns = Population,
    palette = c("#000000", "#0891b2", "#22d3ee", "#FFD700")
  ) |>
  
  # Make text white on colored backgrounds
  tab_style(
    style = cell_text(color = "white", weight = "bold"),
    locations = cells_body(columns = Population)
  ) |>
  
  # Add vertical divider between columns
  gt_add_divider(columns = `Age Category`, color = "#e5e5e5") |>
  
  # Adjust font sizes
  tab_options(
    table.font.size = px(13),
    heading.title.font.size = px(16)
  )

Age Category	Population	Percentage
Population Distribution by Age Category
National Summary
0-14	3,659,337	44.30%
15-24	1,628,835	19.72%
25-34	1,234,926	14.95%
35-44	815,517	9.87%
45-54	473,365	5.73%
55-64	237,426	2.87%
65+	211,084	2.56%

Step 22: Visualize Age Distribution

age_summary |>
  
  # Reorder age categories by population for better visual ranking
  mutate(age_category = fct_reorder(age_category, population)) |>
  
  # Create plot
  ggplot(aes(x = age_category, y = population, fill = population)) +
  
  # Bar chart
  geom_col(alpha = 0.9, show.legend = FALSE) +
  
  # Add text labels showing exact population
  geom_text(
    aes(label = comma(population)),
    hjust = -0.1,  # Position slightly outside the bar
    size = 3.5,
    fontface = "bold",
    color = "#06b6d4"
  ) +
  
  # Color gradient from dark to light
  scale_fill_gradient(low = "#000000", high = "#FFD700") +
  
  # Format y-axis and add space for text labels
  scale_y_continuous(
    labels = label_number(scale = 1e-3, suffix = "K"),
    expand = expansion(mult = c(0, 0.15))  # Add 15% space on right for labels
  ) +
  
  # Horizontal bars
  coord_flip() +
  
  # Labels
  labs(
    title = "**Population Distribution by Age Category**",
    subtitle = "South Sudan 2008 Census | National Summary",
    x = NULL,
    y = "Population"
  ) +
  
  # Theme adjustments
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.major.x = element_line(color = "#e5e5e5")
  )

Population distribution across age categories with exact counts labeled

Understanding scale_y_continuous()

expansion(mult = c(0, 0.15)) controls space around the plot:

First value (0): No extra space on the left
Second value (0.15): Add 15% extra space on the right

Why? To make room for our text labels showing exact population counts!

Without this, the labels would get cut off at the edge of the plot.

Key Insights

What the Data Tells Us

1. Population Concentration

The top five states account for a significant portion of the national population. This geographic concentration should inform resource allocation decisions.

2. Youth Demographics

The age distribution reveals a young population—typical of developing nations. This “youth bulge” represents both: - Opportunity: Large workforce potential - Challenge: Need for education and employment infrastructure

3. Gender Balance

Most states show relatively balanced gender distributions, with some variation that may reflect: - Migration patterns - Conflict impacts
- Data collection methodology

4. Regional Disparities

Substantial population differences between states suggest the need for: - Differentiated development strategies - Targeted resource allocation - Context-specific policy interventions

Conclusion

Congratulations! You’ve completed a comprehensive demographic analysis using R and the tidyverse.

What You’ve Learned

Data Skills: ✅ Loading data from URLs with read_csv()
✅ Cleaning data with tidyverse functions
✅ String splitting and parsing with str_split_i(), str_remove(), and regex
✅ Extracting information from structured text (gender from “Population - Male (Number)”)
✅ Recategorizing data with case_when() for age groups
✅ Grouping and summarizing with group_by() and summarise()
✅ Reshaping data with pivot_wider() and pivot_longer()
✅ Calculating percentages and ratios

Visualization Skills: ✅ Creating pie charts and bar charts
✅ Customizing colors and themes
✅ Adding informative labels and titles
✅ Using coord_flip() for horizontal layouts
✅ Understanding Grammar of Graphics principles

Table Skills: ✅ Building professional tables with gt
✅ Formatting numbers and percentages
✅ Adding colors and styling
✅ Creating informative footnotes

String Processing Skills: ✅ Multiple methods for text extraction (split, regex, remove, separate)
✅ Using separate_wider_delim() to split into multiple columns
✅ Using str_split_i() to extract specific pieces
✅ Conditional text transformation with case_when()
✅ Understanding when to use each method
✅ Regular expressions for pattern matching

Workflow Skills: ✅ Using the pipe operator |> for readable code
✅ Writing clear, commented code
✅ Creating reproducible analyses
✅ Structuring code in logical steps

Next Steps for Learning

Beginner: 1. Practice with different datasets 2. Try modifying the colors and themes 3. Experiment with different chart types

Intermediate: 4. Learn about purrr for functional programming 5. Explore stringr for text manipulation 6. Study lubridate for date handling

Advanced: 7. Create interactive dashboards with Shiny 8. Build custom functions and packages 9. Contribute to open-source R projects

Resources: - R for Data Science - Free online book - RStudio Cheatsheets - Quick references - TidyTuesday - Weekly practice datasets

Technical Reference

Packages Used

Package	Version	Purpose
tidyverse	2.0+	Meta-package including dplyr, ggplot2, tidyr, readr
gt	0.10+	Grammar of Tables for professional tables
gtExtras	0.5+	Extended gt functionality
ggtext	0.1+	Rich text rendering in ggplot2
scales	1.3+	Scale functions for number formatting
glue	1.7+	String interpolation

Key Functions Demonstrated

Function	Package	Purpose
`read_csv()`	readr	Load CSV files
`glimpse()`	dplyr	View data structure
`select()`	dplyr	Choose columns
`filter()`	dplyr	Choose rows
`mutate()`	dplyr	Create/modify columns
`case_when()`	dplyr	Multi-condition IF statements
`group_by()`	dplyr	Group data
`summarise()`	dplyr	Calculate summaries
`pivot_wider()`	tidyr	Reshape long→wide
`pivot_longer()`	tidyr	Reshape wide→long
`separate_wider_delim()`	tidyr	Split column into multiple columns
`str_split_i()`	stringr	Split strings and extract piece
`str_remove()`	stringr	Remove text patterns
`str_extract()`	stringr	Extract text with regex
`str_squish()`	stringr	Remove extra whitespace
`ggplot()`	ggplot2	Create visualizations
`gt()`	gt	Create tables

Quarto Features Used

Code chunk labels (#| label:) for organization
Code summaries (#| code-summary:) for collapsible sections
Figure captions (#| fig-cap:) for accessibility
Code line numbers (#| code-line-numbers: true) for teaching
Callout blocks (tip, note, important) for emphasis
Cross-references (#sec-intro) for navigation
Themed output for consistent appearance

About the Author

Alierwai Reng is the Founder and Lead Educator of PyStatR+, a data science educator, and analytics leader with expertise in statistics and healthcare analytics. His mission is to make technical knowledge accessible through clear, beginner-friendly education. He believes in “Education from the Heart.”

For training, consulting, or collaboration opportunities: 📧 info@pystatrplus.org 🌐 pystatrplus.org

Editor’s Note

This tutorial reflects PyStatR+’s core philosophy: that data science education should be accessible, practical, and empowering. We believe the best learning happens when complexity is distilled into clarity—without sacrificing rigor.

At PyStatR+, we teach from the heart by putting ourselves in your shoes—because learning is a partnership, not a solitary journey.

PyStatR+: Learning Simplified. Communication Amplified.