---
title: "Demographic Patterns in South Sudan: A Tidyverse Exploration"
subtitle: "A Beginner's Guide to Data Analysis with R and the Tidyverse"
author: "Alierwai Reng"
date: "2024-11-20"
categories: [R, Tidyverse, ggplot2, Demographics, Data Visualization]
image: featured.png
description: "Learn data analysis step-by-step using R's tidyverse! This beginner-friendly tutorial explores South Sudan's demographics with clear explanations, beautiful visualizations, and professional tables."
format:
html:
code-fold: false
code-tools: true
toc: true
toc-depth: 3
toc-title: "Tutorial Contents"
self-contained: true
theme: cosmo
execute:
warning: false
message: false
---
## Introduction {#sec-intro}
Welcome to this hands-on data analysis tutorial! By the end of this guide, you'll understand how to:
- **Load and explore** real-world census data
- **Clean and transform** data using tidyverse functions
- **Calculate** summary statistics and group-level metrics
- **Create beautiful visualizations** with ggplot2
- **Build professional tables** with gt
We'll analyze South Sudan's 2008 census data, but the techniques you learn apply to any dataset.
::: {.callout-tip}
## What is the Tidyverse?
The **tidyverse** is a collection of R packages designed to work together seamlessly for data science. Think of it as a complete toolkit where all the tools fit perfectly in your hand. Key packages include:
- **dplyr** - data manipulation (filter, group, summarize)
- **ggplot2** - data visualization (charts and graphs)
- **tidyr** - data tidying (reshape and clean)
- **readr** - data import (read CSV, Excel, etc.)
These packages share a common design philosophy, making your code readable and your workflow intuitive!
:::
---
## Part 1: Environment Setup {#sec-setup}
### Step 1: Load Required Packages
Every R analysis starts by loading the libraries (packages) we need. Think of packages as apps on your phone—each one adds specific capabilities.
```{r}
#| label: load-packages
#| code-summary: "Load tidyverse and visualization packages"
# Core tidyverse packages (loads dplyr, ggplot2, tidyr, readr, and more!)
library(tidyverse)
library(janitor)
# Table formatting packages
library(gt) # Grammar of Tables - for beautiful tables
library(gtExtras) # Extra features for gt tables
# Visualization enhancement packages
library(ggtext) # Rich text formatting in ggplot2
library(scales) # Scale functions for axes and labels
library(glue) # Easy string interpolation
# Confirmation message
cat("✅ All packages loaded successfully!\n")
cat("📦 Tidyverse version:", as.character(packageVersion("tidyverse")), "\n")
```
::: {.callout-note}
## Package Installation
If you don't have these packages installed, run this code **once**:
```r
install.packages(c("tidyverse", "gt", "gtExtras", "ggtext", "scales", "glue"))
```
After installation, you only need to load them with `library()` in each new R session.
:::
### Step 2: Configure Visualization Theme
Setting a consistent theme makes all your visualizations look professional and cohesive. This is like choosing a design template for all your charts.
```{r}
#| label: setup-theme
#| code-summary: "Configure default ggplot2 theme"
# Set default theme for ALL ggplot2 visualizations
theme_set(
theme_minimal(base_size = 13, base_family = "sans") +
theme(
# Title styling - bold and colored
plot.title = element_markdown(
size = 16,
face = "bold",
color = "#06b6d4",
margin = margin(b = 10) # Space below title
),
# Subtitle styling - smaller and gray
plot.subtitle = element_markdown(
size = 12,
color = "#666666",
margin = margin(b = 15) # Space below subtitle
),
# Caption styling - small and light gray
plot.caption = element_markdown(
size = 9,
color = "#999999",
hjust = 0 # Left-aligned
),
# Grid lines
panel.grid.minor = element_blank(), # Remove minor grid lines
panel.grid.major = element_line(color = "#e5e5e5", linewidth = 0.3),
# Legend positioning and styling
legend.position = "top",
legend.title = element_text(face = "bold", size = 11),
# Axis labels
axis.title = element_text(face = "bold", size = 11)
)
)
cat("🎨 Custom theme configured!\n")
```
::: {.callout-tip}
## Why Use theme_set()?
Using `theme_set()` means you don't have to add the same theme code to every plot. Set it once at the beginning, and all your plots will look consistent!
Think of it like setting your phone's wallpaper—you set it once, and it applies everywhere.
:::
---
## Part 2: Loading and Exploring Data {#sec-load}
### Step 3: Load Census Data from URL
Real datasets often live on the web. The `readr` package (part of tidyverse) can read directly from URLs—no download required!
```{r}
#| label: load-data
#| code-summary: "Load census data from GitHub"
# Define the data source URL
url <- "https://raw.githubusercontent.com/tongakuot/r_tutorials/main/06-data-wrangling/00-input/ss_2008_census_data_raw.csv"
# Read the CSV file into R
# read_csv() is from the readr package (part of tidyverse)
census_raw <- read_csv(url, show_col_types = FALSE)
# Display basic information about the dataset
cat("✅ Data loaded successfully!\n")
cat("📋 Dimensions:", nrow(census_raw), "rows ×", ncol(census_raw), "columns\n")
```
::: {.callout-note}
## read_csv() vs read.csv()
R has two main functions for reading CSV files:
- **`read.csv()`** - Base R function (slower, creates factors by default)
- **`read_csv()`** - Tidyverse function (faster, better default behavior)
We use `read_csv()` because it's faster and works better with tidyverse workflows!
:::
### Step 4: Examine the Data Structure
Before cleaning, we must **understand** our data. What columns exist? What types are they? Let's investigate!
```{r}
#| label: examine-structure
#| code-summary: "View data structure with glimpse()"
# glimpse() is like str() but more readable
# It shows: column names, data types, and first few values
glimpse(census_raw)
```
::: {.callout-tip}
## Understanding glimpse() Output
The output shows:
- **Rows** and **Columns** count at the top
- Each line shows: `column_name <data_type> first_few_values`
Common data types:
- `<chr>` = character (text)
- `<dbl>` = double (decimal numbers)
- `<int>` = integer (whole numbers)
- `<lgl>` = logical (TRUE/FALSE)
:::
### Step 5: Preview the Data
Let's look at the actual data to see what we're working with:
```{r}
#| label: preview-data
#| code-summary: "View first 10 rows"
# head() shows the first n rows (default = 6, we're asking for 10)
census_raw |>
head(10)
```
::: {.callout-note}
## The Pipe Operator |>
The pipe operator `|>` means "take this and then do that". It makes code read like a sentence:
```r
# Without pipe (hard to read)
head(census_raw, 10)
# With pipe (reads left to right)
census_raw |> head(10)
```
Read it as: "Take census_raw, THEN show me the head (first rows)"
**Note:** You might also see `%>%` (magrittr pipe) in older code—they work the same way!
:::
---
## Part 3: Data Cleaning {#sec-clean}
Raw data is rarely analysis-ready. We need to clean and standardize it first!
### Step 6 (a): Clean and Transform the Dataset
We'll perform several cleaning operations in one pipeline using the pipe operator `|>`:
```{r}
#| label: clean-data
#| code-summary: "Complete data cleaning pipeline"
#| code-line-numbers: true
census_clean <- census_raw |>
# Step 1: Standardize column names
# Convert to lowercase and replace spaces with underscores
rename_with(~ str_to_lower(.) |> str_replace_all(" ", "_")) |>
# Step 2: Rename columns to meaningful names
select(
state = region_name, # Rename region_name to state
gender = variable_name, # Rename variable_name to gender
age_category = age_name, # Rename age_name to age_category
population = `2008` # Rename 2008 to population
) |>
# Step 3: Clean character (text) columns
mutate(
# Remove extra whitespace and convert to title case
state = str_squish(state) |> str_to_title(),
gender = str_squish(gender) |> str_to_title(),
age_category = str_squish(age_category)
) |>
# Step 4: Ensure population is numeric (integer)
mutate(population = as.integer(population)) |>
# Step 5: Remove rows with missing or invalid data
filter(!is.na(population), population > 0)
# Display results
cat("✅ Data cleaning complete!\n")
cat("📊 Cleaned dataset:", nrow(census_clean), "rows ×", ncol(census_clean), "columns\n")
```
### Step 6 (b): Alternative Cleaning with Janitor Package
Here's a more efficient approach using the `janitor` package:
```{r}
#| label: clean-data-janitor
#| code-summary: "Cleaning pipeline using janitor package"
#| code-line-numbers: true
janitor_df <-
census_raw |>
# Step 1: Standardize column names automatically
# Convert to lowercase and replace spaces with underscores - janitor::clean_names()
clean_names() |>
# Step 2: Rename columns to meaningful names
select(
state = region_name, # Rename region_name to state
gender = variable_name, # Rename variable_name to gender
age_category = age_name, # Rename age_name to age_category
population = x2008 # Rename x2008 to population
) |>
# Step 3: Clean character (text) columns
mutate(
across(where(is.character), \(x) str_to_title(x))
) |>
# Step 4: Ensure population is numeric (integer)
mutate(population = as.integer(population)) |>
# Step 5: Remove rows with missing or invalid data
filter(!is.na(population), population > 0)
# Display results
cat("✅ Data cleaning complete!\n")
cat("📊 Cleaned dataset:", nrow(janitor_df), "rows ×", ncol(janitor_df), "columns\n")
```
::: {.callout-important icon="true"}
## Understanding Both Cleaning Pipelines
Let's break down what each step does in both approaches:
### Step 1: Standardizing Column Names
**Method 6(a) - Manual with `rename_with()`:**
- `rename_with()` applies a function to ALL column names
- `str_to_lower()` converts to lowercase: "State" → "state"
- `str_replace_all()` replaces spaces: "Age Category" → "age_category"
- You control exactly what transformations happen
**Method 6(b) - Automatic with `clean_names()`:**
- `clean_names()` from janitor package does everything at once:
- Converts to lowercase: "Region Name" → "region_name"
- Replaces spaces with underscores: "Age Category" → "age_category"
- Removes special characters: "2008" → "x2008" (adds 'x' to numeric names)
- Makes names unique and R-friendly
- One function, multiple benefits!
### Step 2: Renaming Columns with `select()`
**Both methods do this identically:**
- Chooses which columns to keep
- Renames while selecting: `new_name = old_name`
- Drops columns we don't need
- Note: In 6(b), we rename `x2008` instead of `2008` (because `clean_names()` added the 'x')
### Step 3: Cleaning Text Columns with `mutate()`
**Method 6(a) - One Column at a Time:**
```r
mutate(
state = str_squish(state) |> str_to_title(),
gender = str_squish(gender) |> str_to_title(),
age_category = str_squish(age_category)
)
```
- Each column is cleaned individually
- `str_squish()` removes extra spaces: "South Sudan" → "South Sudan"
- `str_to_title()` capitalizes words: "JUBA" → "Juba"
- More typing, but explicit control over each column
**Method 6(b) - All at Once with `across()`:**
```r
mutate(
across(where(is.character), \(x) str_to_title(x))
)
```
- `across()` applies a function to multiple columns at once
- `where(is.character)` automatically selects only text columns
- `\(x)` is a shorthand function (lambda) that takes each column (x)
- Applies `str_to_title()` to capitalize: "new york" → "New York"
- Less typing, works on any number of text columns!
- **Note:** This doesn't use `str_squish()`, assuming data is already clean
### Step 4: Converting Population to Integer
**Both methods use `as.integer()`:**
- Converts to whole numbers only (no decimals)
- Makes sense for population - you can't have 3.5 people!
- Uses less memory than `as.numeric()`
- Ensures data integrity for counting
**Why not `as.numeric()`?**
- `as.numeric()` allows decimals: 1234.5
- Population should be whole numbers: 1234
- `as.integer()` is the correct choice for count data
### Step 5: Filtering Out Bad Data
**Both methods use the same filter:**
- `!is.na(population)` removes rows where population is missing (NA)
- `population > 0` removes rows where population is zero or negative
- Only keeps valid, positive population counts
:::
::: {.callout-tip icon="true"}
## When to Use Each Method
**Use Method 6(a) when:**
- You need precise control over each transformation
- You're learning the fundamentals of data cleaning
- Different columns need different cleaning steps
- You want to see exactly what's happening at each step
**Use Method 6(b) when:**
- You have many similar columns to clean
- You want cleaner, more concise code
- Column names are messy and need standardization
- You're comfortable with more advanced R techniques
**The Result:**
Both methods produce identical, analysis-ready datasets! The choice depends on your preference for explicit control vs. efficiency.
:::
### Step 7: Preview the Cleaned Data
Let's see how our cleaned data looks now:
```{r}
#| label: preview-cleaned
#| code-summary: "View cleaned data"
# View first 10 rows of cleaned data
census_clean |>
head(10)
```
::: {.callout-tip}
## Data Cleaning Checklist
When cleaning data, always:
✅ **Standardize names** - Use consistent naming (lowercase, underscores)
✅ **Remove whitespace** - Trim extra spaces that cause problems
✅ **Fix data types** - Numbers should be numeric, not text
✅ **Handle missing values** - Decide: remove, replace, or keep?
✅ **Check for duplicates** - Remove or investigate unusual patterns
Our pipeline handles all of these!
:::
---
## Part 3B: Advanced String Processing {#sec-string-processing}
Now we need to extract more meaningful information from our data. The `gender` column actually contains structured text like "Population - Male (Number)" that we can parse!
### Step 8: Examine the Gender Column Structure
Let's look at what values exist in the gender column:
```{r}
#| label: examine-gender
#| code-summary: "Explore gender column values"
# See unique values in gender column
cat("🔍 Unique values in gender column:\n")
census_clean |>
distinct(gender) |>
pull(gender)
```
::: {.callout-note}
## String Structure Analysis
The gender column follows a pattern: **"Population - Gender (Type)"**
Examples:
- "Population - Male (Number)"
- "Population - Female (Number)"
- "Population - Total (Number)"
We want to extract just the middle piece: **Male**, **Female**, or **Total**
This is called **string splitting** or **text parsing**!
:::
### Step 9: Extract Gender Information (Multiple Methods)
R provides several ways to extract information from text. Let's explore different approaches!
#### Method 1: Using str_split() with List Extraction
The simplest approach: split on a delimiter and extract the piece you want.
```{r}
#| label: gender-method-1
#| code-summary: "Method 1: Basic string splitting"
# Split on " " and extract the second piece
method_1_example <- census_clean |>
select(original_gender = gender) |>
mutate(
# Split creates a list, str_split_i gets the i-th piece
gender_extracted = str_split_i(original_gender, " ", 2)
) |>
distinct()
cat("✅ Method 1: Split on ' ' and extract 2nd piece\n")
method_1_example
```
::: {.callout-tip}
## How str_split_i() Works
`str_split_i(string, pattern, i)` breaks text at a delimiter:
- **string**: The text to split
- **pattern**: What to split on (e.g., " - ")
- **i**: Which piece to extract (1 = first, 2 = second, etc.)
**Example:**
```r
"Population - Male (Number)"
→ split on " "
→ ["Population", "Male (Number)"]
→ extract 2nd piece
→ "Male (Number)"
```
We still need to remove the "(Number)" part!
:::
#### Method 2: Chain Multiple String Operations
Sometimes you need multiple steps to clean extracted text:
```{r}
#| label: gender-method-2
#| code-summary: "Method 2: Chained string operations"
# Split twice: first on " ", then remove parenthetical
method_2_example <- census_clean |>
select(original_gender = gender) |>
mutate(
# Step 1: Extract middle section
temp = str_split_i(original_gender, " ", 2),
# Step 2: Remove everything from " (" onwards
gender_clean = str_remove(temp, " \\(.*\\)")
) |>
select(-temp) |>
distinct()
cat("✅ Method 2: Chained operations for clean extraction\n")
method_2_example
```
::: {.callout-note}
## Understanding str_remove()
`str_remove(string, pattern)` deletes matching text:
- **Pattern**: ` \\(.*\\)` is a regular expression meaning:
- ` \\(` = literal space and opening parenthesis
- `.*` = any characters (zero or more)
- `\\)` = literal closing parenthesis
**Result:** Removes " (Number)" from "Male (Number)" → "Male"
:::
#### Method 3: Using Regular Expressions (Regex)
For complex patterns, regex provides powerful extraction:
```{r}
#| label: gender-method-3
#| code-summary: "Method 3: Regex extraction"
# Use regex pattern to extract text between " - " and " ("
method_3_example <- census_clean |>
select(original_gender = gender) |>
mutate(
gender_regex = str_extract(original_gender, "(?<= ).*(?= \\()")
) |>
distinct()
cat("✅ Method 3: Regex pattern matching\n")
method_3_example
```
::: {.callout-important}
## Understanding the Regex Pattern
`"(?<= ).*(?= \\()"` uses **lookaround assertions**:
**Pattern breakdown:**
- `(?<= )` = **Lookbehind**: Must be preceded by " - " (but don't include it)
- `.*` = **Match**: Any characters (this is what we capture)
- `(?= \\()` = **Lookahead**: Must be followed by " (" (but don't include it)
**In plain English:**
"Find text that comes after ' - ' and before ' ('"
**Example:**
"Population - Male (Number)"
- After " - ": "Male (Number)"
- Before " (": "Male"
- **Captured: "Male"** ✅
:::
#### Method 4: Using str_remove() to Delete Unwanted Parts
Remove prefix and suffix to isolate what you need:
```{r}
#| label: gender-method-4
#| code-summary: "Method 4: Remove unwanted parts"
# Remove the prefix and suffix
method_4_example <- census_clean |>
select(original_gender = gender) |>
mutate(
gender_replace = original_gender |>
str_remove("Population, ") |> # Remove prefix
str_remove(" \\(Number\\)") # Remove suffix
) |>
distinct()
cat("✅ Method 4: String removal approach\n")
method_4_example
```
#### Method 5: Complete Solution in One Pipeline
Now let's apply the cleanest method to our actual dataset:
```{r}
#| label: gender-method-5-final
#| code-summary: "Method 5: Final production solution"
#| code-line-numbers: true
# Apply gender extraction to our dataset
census_parsed <- census_clean |>
mutate(
# Extract gender: split on " ", take 2nd piece, remove parenthetical
gender = str_split_i(gender, " ", 2) |>
str_remove(" \\(.*\\)") |>
str_squish() # Remove any extra whitespace
)
# Verify extraction worked
cat("✅ Gender extraction complete!\n")
cat("🎯 Unique gender values:\n")
census_parsed |>
distinct(gender) |>
pull(gender)
```
::: {.callout-tip}
## Which Method Should You Use?
Choose based on your needs:
| Method | Best For | Pros | Cons |
|--------|----------|------|------|
| **Method 1** | Simple, consistent patterns | Fast, readable | May need cleanup |
| **Method 2** | Multi-step cleaning | Very clear logic | More verbose |
| **Method 3** | Complex patterns | Most powerful | Requires regex knowledge |
| **Method 4** | Known prefix/suffix | Intuitive | Less flexible |
| **Method 5** | Production code | Clean, efficient | Combines multiple concepts |
**For this tutorial:** Method 5 is ideal—it's clean, efficient, and production-ready!
:::
#### Method 6: Using separate_wider_delim() (Modern Tidyr Approach)
The `separate_wider_delim()` function from tidyr provides a clean, declarative way to split columns into multiple pieces. This is a modern approach that's perfect for structured text!
```{r}
#| label: gender-method-6-separate
#| code-summary: "Method 6: separate_wider_delim() from tidyr"
#| code-line-numbers: true
# Demonstrate separate_wider_delim()
method_6_example <- census_clean |>
select(original_gender = gender) |>
# Step 1: Split into three parts by " - "
separate_wider_delim(
cols = original_gender,
delim = " ",
names = c("prefix", "gender_raw", "suffix"),
too_few = "align_start" # Handle any rows with fewer delimiters
) |>
# Step 2: Clean the extracted gender piece
mutate(
gender_clean = str_remove(gender_raw, " \\(.*\\)") |>
str_squish()
) |>
# Show the transformation
select(prefix, gender_raw, gender_clean) |>
distinct()
cat("✅ Method 6: separate_wider_delim() approach\n")
method_6_example
```
::: {.callout-important icon="true"}
## Understanding separate_wider_delim()
`separate_wider_delim()` is designed specifically for splitting one column into multiple columns:
**Structure:**
```r
separate_wider_delim(
cols = column_to_split,
delim = "delimiter",
names = c("col1", "col2", "col3"),
too_few = "align_start" # What to do if fewer pieces than expected
)
```
**Key parameters:**
- **`cols`**: Which column to split
- **`delim`**: What to split on (our case: " - ")
- **`names`**: Names for the new columns created
- **`too_few`**: How to handle rows with fewer delimiters than expected
**What happens to our data:**
**Before:**
```
original_gender: "Population - Male (Number)"
```
**After:**
```
prefix: "Population"
gender_raw: "Male (Number)"
suffix: "" (empty - no third piece)
```
**Then we clean `gender_raw`** to remove "(Number)"!
:::
::: {.callout-note}
## separate_wider_delim() vs str_split_i()
**Use `separate_wider_delim()` when:**
- You want to keep **multiple pieces** from the split
- You need **named columns** for each piece
- You want **declarative, readable** code
- You're doing **data reshaping** as part of tidying
**Use `str_split_i()` when:**
- You only need **one specific piece**
- You want **fewer intermediate columns**
- You prefer a **more compact** solution
- You're doing **quick transformations**
**Example comparison:**
```r
# separate_wider_delim: Keep all pieces
data |> separate_wider_delim(col, " ", c("a", "b", "c"))
# Result: Three new columns (a, b, c)
# str_split_i: Extract one piece
data |> mutate(b = str_split_i(col, " ", 2))
# Result: One new column (b)
```
**For our final dataset:** We use `str_split_i()` (Method 5) because we only need the middle piece. But `separate_wider_delim()` is excellent when you need multiple pieces!
:::
#### Comparison: All Six Methods Side by Side
Let's see all approaches and their results:
```{r}
#| label: compare-all-methods
#| code-summary: "Compare all six extraction methods"
# Create comparison table
comparison <- census_clean |>
head(3) |>
select(original = gender) |>
mutate(
method_1_split_i = str_split_i(original, " ", 2),
method_2_chained = str_split_i(original, " ", 2) |>
str_remove(" \\(.*\\)"),
method_3_regex = str_extract(original, "(?<= ).*(?= \\()"),
method_4_remove = original |>
str_remove("Population, ") |>
str_remove(" \\(Number\\)"),
method_5_final = str_split_i(original, " ", 2) |>
str_remove(" \\(.*\\)") |>
str_squish()
)
# Add method 6 separately (separate_wider_delim works differently)
comparison_method_6 <- census_clean |>
head(3) |>
select(original = gender) |>
separate_wider_delim(
cols = original,
delim = " ",
names = c("prefix", "gender_raw", "suffix"),
too_few = "align_start"
) |>
mutate(
method_6_separate = str_remove(gender_raw, " \\(.*\\)") |> str_squish()
) |>
pull(method_6_separate)
# Combine and display
comparison |>
mutate(method_6_separate = comparison_method_6) |>
gt() |>
tab_header(
title = md("**Comparison of All Six Methods**"),
subtitle = "Different approaches to extract 'Male' from 'Population - Male (Number)'"
) |>
tab_style(
style = cell_text(size = px(10)),
locations = cells_body()
) |>
tab_style(
style = cell_fill(color = "#f8f9fa"),
locations = cells_body(columns = c(method_5_final, method_6_separate))
) |>
tab_footnote(
footnote = "Methods 5 and 6 (highlighted) produce clean output ready for analysis",
locations = cells_column_labels(columns = c(method_5_final, method_6_separate))
) |>
cols_label(
original = "Original",
method_1_split_i = "M1: split_i",
method_2_chained = "M2: chained",
method_3_regex = "M3: regex",
method_4_remove = "M4: remove",
method_5_final = "M5: final",
method_6_separate = "M6: separate"
)
```
::: {.callout-tip icon="true"}
## Choosing Your Method: Decision Tree
**Start here:** What do you need?
1. **Need multiple pieces from the split?**
- Yes → Use **Method 6: `separate_wider_delim()`**
- No → Continue to #2
2. **Is the pattern very complex (multiple conditions)?**
- Yes → Use **Method 3: Regex `str_extract()`**
- No → Continue to #3
3. **Do you know exact prefix/suffix to remove?**
- Yes → Use **Method 4: `str_remove()`**
- No → Continue to #4
4. **Need just one piece from a split?**
- Yes → Use **Method 5: `str_split_i()` with cleanup** ✅ (Recommended for our case)
5. **Want to see intermediate steps for debugging?**
- Yes → Use **Method 2: Chained operations**
- No → Use **Method 5** (most efficient)
**For learning:** Try all methods!
**For production:** Use Method 5 or 6 (cleanest, most maintainable)
:::
### Step 10: Recategorize Age Groups
Now let's standardize the age categories into broader, more meaningful groups.
First, let's see what age categories we currently have:
```{r}
#| label: examine-age-categories
#| code-summary: "Explore current age categories"
cat("🔍 Current age categories:\n")
census_parsed |>
dplyr::distinct(age_category) |>
dplyr::arrange(age_category) |>
dplyr::pull(age_category)
```
::: {.callout-note}
## Why Recategorize Age Groups?
**Original data:** Fine-grained 5-year age bands (0-4, 5-9, etc.)
**Problem:**
- Too many categories for high-level analysis
- Harder to spot trends
- Difficult to compare with other datasets
**Solution:** Group into broader categories:
- **0-14**: Children
- **15-24**: Youth/Young adults
- **25-34**: Early working age
- **35-44**: Middle working age
- **45-54**: Later working age
- **55-64**: Pre-retirement
- **65+**: Retirement age
This is called **binning** or **categorization**!
:::
#### Method 1: Using case_when() for Conditional Recategorization
The `case_when()` function is perfect for complex, multi-condition transformations:
```{r}
#| label: age-method-1
#| code-summary: "Method 1: case_when() for age recategorization"
#| code-line-numbers: true
census_final <- census_parsed |>
mutate(
age_category = case_when(
# Children (0-14)
age_category %in% c("0 to 4", "5 to 9", "10 to 14") ~ "0-14",
# Youth (15-24)
age_category %in% c("15 to 19", "20 to 24") ~ "15-24",
# Early working age (25-34)
age_category %in% c("25 to 29", "30 to 34") ~ "25-34",
# Middle working age (35-44)
age_category %in% c("35 to 39", "40 to 44") ~ "35-44",
# Later working age (45-54)
age_category %in% c("45 to 49", "50 to 54") ~ "45-54",
# Pre-retirement (55-64)
age_category %in% c("55 to 59", "60 to 64") ~ "55-64",
# Retirement age (65+)
age_category == "65+" ~ "65+",
# Catch any unexpected values
TRUE ~ age_category
)
)
# Verify recategorization
cat("✅ Age categories recategorized!\n")
cat("🎯 New age categories:\n")
census_final |>
distinct(age_category) |>
arrange(age_category) |>
pull(age_category)
```
::: {.callout-important icon="true"}
## Understanding case_when()
`case_when()` is like a multi-way IF statement (similar to SQL's CASE WHEN):
**Structure:**
```r
case_when(
condition1 ~ result1, # If condition1 is TRUE, return result1
condition2 ~ result2, # Else if condition2 is TRUE, return result2
condition3 ~ result3, # Else if condition3 is TRUE, return result3
TRUE ~ default # Else return default (catch-all)
)
```
**Key points:**
- Conditions are evaluated **in order** (first match wins)
- `%in%` checks if value is in a vector (like "is one of")
- `~` separates condition from result
- `TRUE ~` at the end catches anything not matched above
**Example for our data:**
- If age is "0 To 4" OR "5 To 9" OR "10 To 14" → return "0-14"
- Else if age is "15 To 19" OR "20 To 24" → return "15-24"
- And so on...
:::
#### Method 2: Using Named Vector with Lookup (Alternative)
For simple 1-to-1 mappings, a named vector can work:
```{r}
#| label: age-method-2-demo
#| code-summary: "Method 2: Named vector mapping (for demonstration)"
# Create mapping vector
age_mapping <- c(
"0 to 4" = "0-14", "5 to 9" = "0-14",
"10 to 14" = "0-14", "15 to 19" = "15-24",
"20 to 24" = "15-24", "25 to 29" = "25-34",
"30 to 34" = "25-34", "35 to 39" = "35-44",
"40 to 44" = "35-44", "45 to 49" = "45-54",
"50 to 54" = "45-54", "55 to 59" = "55-64",
"60 to 64" = "55-64", "65+" = "65+"
)
# Demonstrate lookup (not applied to dataset)
demo_recode <- census_parsed |>
select(age_category) |>
mutate(
age_category_alt = age_mapping[age_category]
) |>
distinct()
cat("✅ Method 2 demonstration:\n")
demo_recode
```
::: {.callout-tip}
## Comparing case_when() vs Named Vector
**Use case_when() when:**
- Multiple conditions per category
- Complex logic (AND/OR operations)
- Need to explain your logic clearly
- **Best for our use case** ✅
**Use named vector when:**
- Simple 1-to-1 replacements
- Large number of mappings
- Mapping stored separately from code
Both work, but `case_when()` is more readable and maintainable for conditional logic!
:::
## Step 11: Verify Final Cleaned Dataset
#### Step 11 (a): Removing Total Rows Using `filter()`
```{r}
# Filter out rows where gender or age_category contain "Total"
# This removes aggregate/summary rows, keeping only individual demographic categories
census_filtered <- census_final |>
filter(gender != "Total", age_category != "Total")
# Display the filtered dataset
census_filtered
```
#### Step 11 (b): Alternative Filtering Approach with OR Logic
```{r}
# Alternative method: Remove "Total" values using explicit OR logic
# Useful when you want to clearly see the filtering conditions
# Note: This approach is not used in subsequent analyses - shown for demonstration only
filtered_df <- census_final |>
filter(!((gender == "Total") | (age_category == "Total")))
# Display the filtered dataset
filtered_df
```
Let's confirm our data cleaning and transformations worked correctly:
```{r}
#| label: verify-final
#| code-summary: "Verify final cleaned dataset"
# Generate a comprehensive summary of the cleaned and transformed dataset
# This verification step ensures all transformations were applied correctly
cat(strrep("=", 50), "\n", sep = "")
cat("🎉 DATA TRANSFORMATION COMPLETE!\n")
cat(strrep("=", 50), "\n", sep = "")
# Display dimensions of the final cleaned dataset
cat("📊 Final dataset dimensions:\n")
cat(" Rows:", nrow(census_filtered), "\n")
cat(" Columns:", ncol(census_filtered), "\n\n")
# List all column names in the cleaned dataset
cat("✅ Column names:\n")
cat(" ", paste(names(census_final), collapse = ", "), "\n\n")
# Show all unique gender categories (should only be "Male", "Female", and "Total" before filtering)
cat("🎯 Unique gender values:\n")
census_filtered |> distinct(gender) |> pull(gender) |> cat(" ", "\n")
# Show all unique age categories in sorted order
cat("\n🎯 Unique age categories:\n")
census_filtered |> distinct(age_category) |> arrange(age_category) |> pull(age_category) |> cat(" ", "\n")
# Display first 10 rows of key columns to visually inspect the data
cat("\n📋 Sample of final data:\n")
census_filtered |>
select(state, gender, age_category, population) |>
head(10)
```
::: {.callout-note}
## Data Transformation Summary
**What we accomplished:**
**Original `gender` column values:**
- "Population - Male (Number)"
- "Population - Female (Number)"
- "Population - Total (Number)"
**Transformed to clean format:**
- "Male"
- "Female"
- "Total"
**Original `age_category` column structure:**
- 16 individual 5-year age bands
- Examples: "0 To 4", "5 To 9", "10 To 14", ..., "65+"
**Transformed to standardized life-stage categories:**
- 7 broader, more interpretable age groups
- "0-14", "15-24", "25-34", "35-44", "45-54", "55-64", "65+"
**Result:** A clean, standardized dataset ready for analysis and visualization! ✨
:::
---
::: {.callout-tip}
## Data Cleaning Best Practices Checklist
Follow these essential steps when cleaning any dataset:
✅ **Standardize column names** - Use consistent formatting (lowercase, underscores, no spaces)
✅ **Remove unnecessary whitespace** - Trim leading/trailing spaces that cause matching errors
✅ **Ensure correct data types** - Verify numeric data is stored as numbers, not text
✅ **Address missing values** - Decide upfront: remove rows, replace with values, or keep as-is?
✅ **Identify and handle duplicates** - Remove exact duplicates or investigate patterns
✅ **Remove aggregate rows** - Filter out summary/total rows that skew analysis
**Our data pipeline addresses all of these considerations!**
:::
---
## Part 4: Data Exploration and Summary {#sec-explore}
### Step 12: Create Overview Statistics
Let's calculate some key statistics about our dataset:
```{r}
#| label: overview-stats
#| code-summary: "Calculate summary statistics"
# Create a summary table
overview_table <- census_filtered |>
summarise(
`Total Population` = comma(sum(population)), # Format with commas
`Number of States` = n_distinct(state), # Count unique states
`Age Categories` = n_distinct(age_category), # Count unique ages
`Gender Groups` = n_distinct(gender), # Count unique genders
`Total Observations` = comma(n()) # Count all rows
)
# Display the summary
overview_table
```
::: {.callout-note}
## Understanding summarise()
`summarise()` collapses data into summary statistics:
- `sum()` - adds up values
- `n_distinct()` - counts unique values
- `n()` - counts total rows
- `comma()` - formats numbers with commas (from scales package)
It reduces many rows into one row of summaries!
:::
### Step 13: Display as Professional Table
Now let's make this summary look professional using the `gt` package:
```{r}
#| label: overview-table-styled
#| code-summary: "Create styled table with gt"
overview_table |>
gt() |>
tab_header(
title = md("**South Sudan 2008 Census Overview**"),
subtitle = "Key Summary Statistics"
) |>
tab_style(
style = cell_fill(color = "#22d3ee"),
locations = cells_body()
) |>
tab_style(
style = cell_text(color = "white", weight = "bold"),
locations = cells_body()
) |>
tab_options(
table.font.size = px(13),
heading.title.font.size = px(16),
heading.subtitle.font.size = px(12)
)
```
::: {.callout-tip icon="true"}
## The Grammar of Tables (gt)
The `gt` package uses a **layered approach** (like ggplot2 for tables):
1. **Start with data** → `gt()`
2. **Add headers** → `tab_header()`
3. **Style cells** → `tab_style()`
4. **Format numbers** → `fmt_number()`
5. **Adjust options** → `tab_options()`
Each layer adds or modifies the table appearance!
:::
---
## Part 5: Gender Analysis {#sec-gender}
### Step 14: Calculate National Gender Distribution
Let's analyze how the population is distributed by gender:
```{r}
#| label: gender-summary
#| code-summary: "Calculate gender distribution"
#| code-line-numbers: true
gender_summary <- census_filtered |>
# Step 1: Group data by gender
group_by(gender) |>
# Step 2: Calculate total population for each gender
summarise(
population = sum(population),
.groups = "drop" # Remove grouping after summarise
) |>
# Step 3: Calculate percentages
mutate(
percentage = population / sum(population) * 100,
percentage_label = percent(percentage / 100, accuracy = 0.01)
) |>
# Step 4: Sort by population (largest first)
arrange(desc(population))
# Display the results
gender_summary
```
::: {.callout-important}
## Understanding group_by() and summarise()
These two functions work together like a team:
**`group_by(gender)`**
- Splits data into groups (one for Male, one for Female)
- Like separating cards into piles
**`summarise(population = sum(population))`**
- Performs calculations within each group
- `sum()` adds up all population values in each group
- Like counting cards in each pile
**`.groups = "drop"`**
- Removes the grouping after we're done
- Prevents unexpected behavior in future operations
**Final result:** One row per gender with total population!
:::
### Step 15: Display Gender Table
```{r}
#| label: gender-table
#| code-summary: "Create styled gender distribution table"
gender_summary |>
# Rename columns for display
select(
Gender = gender,
Population = population,
`Percentage` = percentage_label
) |>
# Create gt table
gt() |>
# Add title and subtitle
tab_header(
title = md("**National Gender Distribution**"),
subtitle = "South Sudan 2008 Census"
) |>
# Format population with commas
fmt_number(
columns = Population,
decimals = 0,
use_seps = TRUE
) |>
# Style Male row (first row) - cyan background
tab_style(
style = list(
cell_fill(color = "#22d3ee"),
cell_text(color = "white", weight = "bold")
),
locations = cells_body(rows = 1)
) |>
# Style Female row (second row) - gold background
tab_style(
style = list(
cell_fill(color = "#FFD700"),
cell_text(color = "#000000", weight = "bold")
),
locations = cells_body(rows = 2)
) |>
# Center all columns
cols_align(align = "center", columns = everything()) |>
# Adjust font sizes
tab_options(
table.font.size = px(13),
heading.title.font.size = px(16)
)
```
### Step 16: Visualize Gender Distribution
Numbers are great, but visualizations make patterns instantly clear. Let's create a pie chart:
```{r}
#| label: gender-viz
#| code-summary: "Create pie chart for gender distribution"
#| fig-width: 10
#| fig-height: 6
#| fig-cap: "Gender distribution shown as a pie chart with population counts and percentages"
ggplot(gender_summary, aes(x = "", y = population, fill = gender)) +
# Create a bar chart (we'll turn it into a pie)
geom_col(width = 1, color = "white", linewidth = 2) +
# Convert bar chart to pie chart using polar coordinates
coord_polar(theta = "y") +
# Set custom colors for Male and Female
scale_fill_manual(
values = c(
"Male" = "#22d3ee",
"Female" = "#FFD700"
)
) +
# Add labels showing counts and percentages
geom_text(
aes(label = glue("{comma(population)}\n({percentage_label})")),
position = position_stack(vjust = 0.5), # Center in each slice
size = 5,
fontface = "bold",
color = "white"
) +
# Add titles and labels
labs(
title = "**Gender Distribution in South Sudan**",
subtitle = "2008 Census Data",
fill = "Gender"
) +
# Use void theme for pie charts (removes axes)
theme_void() +
# Customize title and legend
theme(
plot.title = element_markdown(
size = 16,
face = "bold",
color = "#06b6d4",
hjust = 0.5, # Center title
margin = margin(b = 5)
),
plot.subtitle = element_markdown(
size = 12,
color = "#666666",
hjust = 0.5 # Center subtitle
),
legend.position = "bottom",
legend.title = element_text(face = "bold", size = 11)
)
```
::: {.callout-tip}
## Anatomy of a ggplot2 Chart
Every ggplot2 visualization follows this pattern:
1. **Start with data** → `ggplot(data, aes(...))`
2. **Add geometry** → `geom_*()` (point, line, bar, etc.)
3. **Adjust scales** → `scale_*()` (colors, axes, etc.)
4. **Add labels** → `labs()` (title, axes, etc.)
5. **Apply theme** → `theme_*()` (appearance)
Think of it like building with LEGO blocks—each layer adds something!
**Bonus:** `coord_polar()` transforms rectangular plots into circular ones (bar chart → pie chart)!
:::
### Step 17: Gender Distribution by State
Now let's see how gender distribution varies across different states:
```{r}
#| label: state-gender-analysis
#| code-summary: "Calculate state-level gender statistics"
state_gender <- census_filtered |>
# Group by both state AND gender
group_by(state, gender) |>
# Sum population within each state-gender combination
summarise(population = sum(population), .groups = "drop") |>
# Reshape from long to wide format
# Before: Multiple rows per state (one for Male, one for Female)
# After: One row per state (Male and Female as separate columns)
pivot_wider(names_from = gender, values_from = population) |>
# Calculate additional metrics
mutate(
total = Male + Female, # Total population
male_pct = Male / total * 100, # Male percentage
female_pct = Female / total * 100, # Female percentage
gender_ratio = Male / Female * 100 # Males per 100 females
) |>
# Sort by total population (largest first)
arrange(desc(total))
# Display top 5 states
state_gender |>
head(5)
```
::: {.callout-note}
## Understanding pivot_wider()
`pivot_wider()` reshapes data from **long** to **wide** format:
**Before (Long format):**
```
State Gender Population
Juba Male 50000
Juba Female 48000
Unity Male 30000
Unity Female 29000
```
**After (Wide format):**
```
State Male Female Total
Juba 50000 48000 98000
Unity 30000 29000 59000
```
Why? Because it's easier to calculate ratios and percentages when Male and Female are in separate columns!
:::
### Step 18: Display State Gender Table
```{r}
#| label: state-gender-table
#| code-summary: "Create styled state gender table"
state_gender |>
head(5) |>
# Select and rename columns for display
select(
State = state,
Male,
Female,
Total = total,
`Male %` = male_pct,
`Female %` = female_pct,
`Gender Ratio` = gender_ratio
) |>
# Create table
gt(rowname_col = "State") |>
cols_align(columns = State, align = "right") |>
# Add header
tab_header(
title = md("**Gender Distribution by State**"),
subtitle = "Top 10 Most Populous States"
) |>
# Format population columns with commas
fmt_number(
columns = c(Male, Female, Total),
decimals = 0,
use_seps = TRUE
) |>
# Format percentage and ratio columns
fmt_number(
columns = c(`Male %`, `Female %`, `Gender Ratio`),
decimals = 2
) |>
# Add color gradient to Gender Ratio
# Values near 100 are balanced (white)
# Values far from 100 show imbalance (colored)
data_color(
columns = `Gender Ratio`,
palette = c("#FFD700", "#ffffff", "#22d3ee"),
domain = c(90, 120)
) |>
# Highlight State column
tab_style(
style = cell_fill(color = "#f8f9fa"),
locations = cells_body(columns = State)
) |>
# Add footnote explaining Gender Ratio
tab_footnote(
footnote = "Gender Ratio represents males per 100 females. Values near 100 indicate balance.",
locations = cells_column_labels(columns = `Gender Ratio`)
) |>
# Apply pre-built theme
gt_theme_538(quiet = TRUE) |>
# Adjust font sizes
tab_options(
table.font.size = px(12),
heading.title.font.size = px(16),
footnotes.font.size = px(10)
)
```
::: {.callout-important}
## Understanding Gender Ratio
**Gender Ratio** = (Males / Females) × 100
- **Ratio = 100**: Perfect balance (equal males and females)
- **Ratio > 100**: More males than females
- **Ratio < 100**: More females than males
For example:
- Ratio of 105 means 105 males per 100 females (5% more males)
- Ratio of 95 means 95 males per 100 females (5% fewer males)
:::
### Step 19: Visualize State Gender Distribution
```{r}
#| label: state-gender-viz
#| code-summary: "Create grouped bar chart by state and gender"
#| fig-width: 12
#| fig-height: 8
#| fig-cap: "Population by state and gender for the top 10 most populous states"
state_gender |>
head(10) |>
# Convert from wide to long format for plotting
# Need separate rows for Male and Female to create grouped bars
pivot_longer(
cols = c(Male, Female),
names_to = "gender",
values_to = "population"
) |>
# Reorder states by total population for better visualization
mutate(state = fct_reorder(state, total)) |>
# Create plot
ggplot(aes(x = state, y = population, fill = gender)) +
# Grouped bar chart (bars side by side)
geom_col(position = "dodge", alpha = 0.9, width = 0.7) +
# Set colors
scale_fill_manual(
values = c(
"Male" = "#22d3ee",
"Female" = "#FFD700"
)
) +
# Format y-axis labels (show as "100K" instead of "100000")
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "K")) +
# Flip coordinates (horizontal bars are easier to read)
coord_flip() +
# Add labels
labs(
title = "**Population by State and Gender**",
subtitle = "Top 10 Most Populous States | South Sudan 2008 Census",
x = NULL, # Remove x-axis label (it says "state" which is obvious)
y = "Population",
fill = "Gender"
) +
# Customize theme
theme(
panel.grid.major.y = element_blank(), # Remove horizontal grid lines
panel.grid.major.x = element_line(color = "#e5e5e5"),
legend.position = "top"
)
```
::: {.callout-tip}
## Choosing the Right Chart Type
**Grouped Bar Chart** (what we used):
- Best for: Comparing categories across groups
- Shows: Exact values for each category
- Advantage: Easy to compare Male vs Female within each state
**Stacked Bar Chart** (alternative):
- Best for: Showing part-to-whole relationships
- Shows: Total and composition
- Advantage: Shows total population at a glance
**Why coord_flip()?**
Long state names are easier to read horizontally than at an angle!
:::
---
## Part 6: Age Category Analysis {#sec-age}
### Step 20: Calculate National Age Distribution
```{r}
#| label: age-summary
#| code-summary: "Calculate age category distribution"
age_summary <- census_filtered |>
# Group by age category
group_by(age_category) |>
# Sum population for each age group
summarise(
population = sum(population),
.groups = "drop"
) |>
# Calculate percentages
mutate(
percentage = population / sum(population) * 100,
percentage_label = percent(percentage / 100, accuracy = 0.01)
) |>
# Sort by population (largest first)
arrange(desc(population))
# Display results
age_summary
```
### Step 21: Display Age Distribution Table
```{r}
#| label: age-table
#| code-summary: "Create styled age distribution table"
age_summary |>
select(
`Age Category` = age_category,
Population = population,
Percentage = percentage_label
) |>
# Create table
gt() |>
# Add header
tab_header(
title = md("**Population Distribution by Age Category**"),
subtitle = "National Summary"
) |>
# Format population with commas
fmt_number(
columns = Population,
decimals = 0,
use_seps = TRUE
) |>
# Add color gradient based on population size
data_color(
columns = Population,
palette = c("#000000", "#0891b2", "#22d3ee", "#FFD700")
) |>
# Make text white on colored backgrounds
tab_style(
style = cell_text(color = "white", weight = "bold"),
locations = cells_body(columns = Population)
) |>
# Add vertical divider between columns
gt_add_divider(columns = `Age Category`, color = "#e5e5e5") |>
# Adjust font sizes
tab_options(
table.font.size = px(13),
heading.title.font.size = px(16)
)
```
### Step 22: Visualize Age Distribution
```{r}
#| label: age-viz
#| code-summary: "Create horizontal bar chart for age distribution"
#| fig-width: 12
#| fig-height: 6
#| fig-cap: "Population distribution across age categories with exact counts labeled"
age_summary |>
# Reorder age categories by population for better visual ranking
mutate(age_category = fct_reorder(age_category, population)) |>
# Create plot
ggplot(aes(x = age_category, y = population, fill = population)) +
# Bar chart
geom_col(alpha = 0.9, show.legend = FALSE) +
# Add text labels showing exact population
geom_text(
aes(label = comma(population)),
hjust = -0.1, # Position slightly outside the bar
size = 3.5,
fontface = "bold",
color = "#06b6d4"
) +
# Color gradient from dark to light
scale_fill_gradient(low = "#000000", high = "#FFD700") +
# Format y-axis and add space for text labels
scale_y_continuous(
labels = label_number(scale = 1e-3, suffix = "K"),
expand = expansion(mult = c(0, 0.15)) # Add 15% space on right for labels
) +
# Horizontal bars
coord_flip() +
# Labels
labs(
title = "**Population Distribution by Age Category**",
subtitle = "South Sudan 2008 Census | National Summary",
x = NULL,
y = "Population"
) +
# Theme adjustments
theme(
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_line(color = "#e5e5e5")
)
```
::: {.callout-note}
## Understanding scale_y_continuous()
**`expansion(mult = c(0, 0.15))`** controls space around the plot:
- First value (0): No extra space on the left
- Second value (0.15): Add 15% extra space on the right
Why? To make room for our text labels showing exact population counts!
Without this, the labels would get cut off at the edge of the plot.
:::
---
## Key Insights {#sec-insights}
::: {.callout-important icon="true"}
## What the Data Tells Us
### 1. **Population Concentration**
The top five states account for a significant portion of the national population. This geographic concentration should inform resource allocation decisions.
### 2. **Youth Demographics**
The age distribution reveals a **young population**—typical of developing nations. This "youth bulge" represents both:
- **Opportunity**: Large workforce potential
- **Challenge**: Need for education and employment infrastructure
### 3. **Gender Balance**
Most states show **relatively balanced** gender distributions, with some variation that may reflect:
- Migration patterns
- Conflict impacts
- Data collection methodology
### 4. **Regional Disparities**
Substantial population differences between states suggest the need for:
- Differentiated development strategies
- Targeted resource allocation
- Context-specific policy interventions
:::
---
## Conclusion {#sec-conclusion}
Congratulations! You've completed a comprehensive demographic analysis using R and the tidyverse.
::: {.callout-tip icon="true"}
## What You've Learned
**Data Skills:**
✅ Loading data from URLs with `read_csv()`
✅ Cleaning data with tidyverse functions
✅ **String splitting and parsing** with `str_split_i()`, `str_remove()`, and regex
✅ **Extracting information from structured text** (gender from "Population - Male (Number)")
✅ **Recategorizing data** with `case_when()` for age groups
✅ Grouping and summarizing with `group_by()` and `summarise()`
✅ Reshaping data with `pivot_wider()` and `pivot_longer()`
✅ Calculating percentages and ratios
**Visualization Skills:**
✅ Creating pie charts and bar charts
✅ Customizing colors and themes
✅ Adding informative labels and titles
✅ Using `coord_flip()` for horizontal layouts
✅ Understanding Grammar of Graphics principles
**Table Skills:**
✅ Building professional tables with gt
✅ Formatting numbers and percentages
✅ Adding colors and styling
✅ Creating informative footnotes
**String Processing Skills:**
✅ Multiple methods for text extraction (split, regex, remove, separate)
✅ Using `separate_wider_delim()` to split into multiple columns
✅ Using `str_split_i()` to extract specific pieces
✅ Conditional text transformation with `case_when()`
✅ Understanding when to use each method
✅ Regular expressions for pattern matching
**Workflow Skills:**
✅ Using the pipe operator `|>` for readable code
✅ Writing clear, commented code
✅ Creating reproducible analyses
✅ Structuring code in logical steps
:::
::: {.callout-note}
## Next Steps for Learning
**Beginner:**
1. Practice with different datasets
2. Try modifying the colors and themes
3. Experiment with different chart types
**Intermediate:**
4. Learn about `purrr` for functional programming
5. Explore `stringr` for text manipulation
6. Study `lubridate` for date handling
**Advanced:**
7. Create interactive dashboards with Shiny
8. Build custom functions and packages
9. Contribute to open-source R projects
**Resources:**
- [R for Data Science](https://r4ds.hadley.nz/) - Free online book
- [RStudio Cheatsheets](https://posit.co/resources/cheatsheets/) - Quick references
- [TidyTuesday](https://github.com/rfordatascience/tidytuesday) - Weekly practice datasets
:::
---
## Technical Reference {#sec-reference}
### Packages Used
| Package | Version | Purpose |
|---------|---------|---------|
| **tidyverse** | 2.0+ | Meta-package including dplyr, ggplot2, tidyr, readr |
| **gt** | 0.10+ | Grammar of Tables for professional tables |
| **gtExtras** | 0.5+ | Extended gt functionality |
| **ggtext** | 0.1+ | Rich text rendering in ggplot2 |
| **scales** | 1.3+ | Scale functions for number formatting |
| **glue** | 1.7+ | String interpolation |
### Key Functions Demonstrated
| Function | Package | Purpose |
|----------|---------|---------|
| `read_csv()` | readr | Load CSV files |
| `glimpse()` | dplyr | View data structure |
| `select()` | dplyr | Choose columns |
| `filter()` | dplyr | Choose rows |
| `mutate()` | dplyr | Create/modify columns |
| `case_when()` | dplyr | Multi-condition IF statements |
| `group_by()` | dplyr | Group data |
| `summarise()` | dplyr | Calculate summaries |
| `pivot_wider()` | tidyr | Reshape long→wide |
| `pivot_longer()` | tidyr | Reshape wide→long |
| `separate_wider_delim()` | tidyr | Split column into multiple columns |
| `str_split_i()` | stringr | Split strings and extract piece |
| `str_remove()` | stringr | Remove text patterns |
| `str_extract()` | stringr | Extract text with regex |
| `str_squish()` | stringr | Remove extra whitespace |
| `ggplot()` | ggplot2 | Create visualizations |
| `gt()` | gt | Create tables |
### Quarto Features Used
- **Code chunk labels** (`#| label:`) for organization
- **Code summaries** (`#| code-summary:`) for collapsible sections
- **Figure captions** (`#| fig-cap:`) for accessibility
- **Code line numbers** (`#| code-line-numbers: true`) for teaching
- **Callout blocks** (tip, note, important) for emphasis
- **Cross-references** (`#sec-intro`) for navigation
- **Themed output** for consistent appearance
---
::: {.callout-tip icon="false"}
## About the Author
Alierwai Reng is the Founder and Lead Educator of PyStatR+, a data science educator, and analytics leader with expertise in statistics and healthcare analytics. His mission is to make technical knowledge accessible through clear, beginner-friendly education. He believes in "Education from the Heart."
**For training, consulting, or collaboration opportunities:**
📧 info@pystatrplus.org
🌐 [pystatrplus.org](https://pystatrplus.org)
:::
::: {.callout-note icon="false"}
## Editor's Note
This tutorial reflects PyStatR+'s core philosophy: that data science education should be accessible, practical, and empowering. We believe the best learning happens when complexity is distilled into clarity—without sacrificing rigor.
At PyStatR+, we teach from the heart by putting ourselves in your shoes—because learning is a partnership, not a solitary journey.
**PyStatR+**: *Learning Simplified. Communication Amplified.*
:::