---
title: "Interactive Demographic Explorer: South Sudan 2008 Census"
subtitle: "A Beginner's Guide to Data Analysis with Polars and Plotly"
author: "Alierwai Reng"
date: "2024-11-20"
categories: [Python, Polars, Interactive Visualization, Demographics, Census]
image: featured.png
description: "Learn data analysis step-by-step! This beginner-friendly tutorial explores South Sudan's demographics using Polars for fast data processing and Plotly for interactive visualizations."
format:
html:
code-fold: false
code-tools: true
toc: true
toc-depth: 3
toc-title: "Tutorial Contents"
self-contained: true
---
## Introduction
Welcome to this hands-on data analysis tutorial! By the end of this guide, you'll understand how to:
- **Load and explore** real-world census data
- **Clean data** using Polars expressions (fast and efficient)
- **Split and transform strings** to extract meaningful information
- **Calculate** summary statistics and group-level metrics
- **Create beautiful visualizations** with plotnine (ggplot2 for Python)
We'll analyze South Sudan's 2008 census data, but the techniques you learn apply to any dataset.
::: {.callout-tip}
## What is Polars?
Polars is a blazingly fast DataFrame library built in Rust with Python bindings. It's designed for speed and memory efficiency, often outperforming pandas by 5-10x on large datasets. Polars uses a **query optimization engine** and **lazy evaluation** to make your data processing incredibly fast!
:::
::: {.callout-tip}
## What is plotnine?
Plotnine is a Python implementation of R's ggplot2, using the **Grammar of Graphics**. It provides a declarative, layered approach to creating visualizations, making it easy to build complex plots with readable code. Perfect for data scientists familiar with R or those who want reproducible, publication-quality graphics!
:::
---
## Part 1: Environment Setup
### Step 1: Import Required Libraries
Every Python analysis starts with importing the tools we need. Think of libraries as specialized toolboxes—each designed for specific tasks.
```{python}
#| label: load-packages
#| message: false
#| warning: false
# Data manipulation
import polars as pl # The blazingly fast DataFrame library
import polars.selectors as cs # Column selectors for easy column operations
# Visualization
from plotnine import * # Grammar of graphics (ggplot2 for Python)
import plotnine as p9
# Tables
from great_tables import GT # Professional table formatting
import great_tables as gt
print("✅ All libraries loaded successfully!")
print(f"1) 📊 Polars version: {pl.__version__};\n2) plotnine version: {p9.__version__};\n3) Great Tables version: {gt.__version__}")
```
::: {.callout-note}
## Library Purposes
| Library | What It Does |
|---------|--------------|
| **polars** | Fast, memory-efficient data manipulation with Rust backend |
| **polars.selectors** | Easy column selection (similar to tidyselect in R) |
| **plotnine** | Grammar of graphics visualization (ggplot2 port for Python) |
| **great_tables** | Makes publication-quality tables |
:::
### Step 2: Configure Visualization Theme
Consistent styling makes your visualizations look professional. We'll create a PyStatR+ branded theme for plotnine.
```{python}
#| label: setup-theme
# Define custom theme for plotnine
pystatr_theme = (
theme_minimal() +
theme(
text=element_text(family="sans-serif", size=12),
plot_title=element_text(size=16, weight="bold", color="#06b6d4"),
plot_subtitle=element_text(size=12, color="#666666"),
axis_text=element_text(size=10),
legend_position="top",
panel_grid_major=element_line(color="#e5e5e5"),
panel_grid_minor=element_blank(),
plot_background=element_rect(fill="white"),
panel_background=element_rect(fill="white")
)
)
# Define custom color palette
pystatr_colors = ["#22d3ee", "#FFD700", "#06b6d4", "#0891b2", "#0e7490"]
print("🎨 Custom theme configured!")
```
---
## Part 2: Loading the Data
### Step 3: Load Census Data from URL
Real datasets often live on the web. Polars can read directly from URLs—no download required!
```{python}
#| label: load-data
# Define the data source URL
data_url = (
"https://raw.githubusercontent.com/tongakuot/r_tutorials/refs/heads/main/00-input/ss_2008_census_data_raw.csv"
)
# Load the data into a DataFrame
df_raw = pl.read_csv(
data_url,
schema_overrides={"2008": pl.Utf8}, # force string
null_values=["", "NA", "N/A", "null"]
)
# Confirm successful load with basic info
print("✅ Data loaded successfully!")
print(f"📋 Rows: {df_raw.shape[0]:,}")
print(f"📋 Columns: {df_raw.shape[1]}")
```
::: {.callout-tip}
## Polars vs Pandas Naming
Polars uses **lazy evaluation** by default with `scan_csv()`. For this tutorial, we use `read_csv()` which loads data immediately (eager execution). Both approaches work, but lazy evaluation is more efficient for large datasets!
:::
---
## Part 3: Data Exploration
Before cleaning, we must **understand** our data. What columns exist? What types are they? Are there issues?
### Step 4: Examine Column Names
```{python}
#| label: examine-columns
# View the original column names and types
print(f"📝 Column names and types: {df_raw.schema}")
```
Notice any issues? Inconsistent capitalization, spaces, or special characters make columns harder to work with.
### Step 5: Check Data Types
```{python}
#| label: check-dtypes
# Polars data types
print("🔍 Data types:")
for col, dtype in df_raw.schema.items():
print(f" {col}: {dtype}")
```
::: {.callout-note}
## Understanding Polars Data Types
| Type | Meaning | Examples |
|------|---------|----------|
| `Int64` | 64-bit integers | 1, 42, 1000 |
| `Float64` | 64-bit floats | 3.14, 99.9 |
| `Utf8` | String (text) | "Male", "Juba" |
| `Date` | Calendar dates | 2024-01-01 |
:::
### Step 6: Preview the Data
```{python}
#| label: preview-data
# View the first 5 rows
print(df_raw.head())
```
### Step 7: Check for Missing Values
```{python}
#| label: check-missing
# Count missing values in each column
missing_counts = df_raw.null_count()
print("❓ Missing values per column:")
print(missing_counts)
```
---
## Part 4: Data Cleaning with Polars Expressions
Now we transform our raw data into analysis-ready format using **Polars expressions**.
### Step 8: Clean Column Names
Polars offers multiple ways to rename columns. Let's explore different approaches for different use cases.
#### Method 1: Dictionary Mapping (Rename Specific Columns)
```{python}
#| label: rename-specific-columns
# Example: Rename specific columns using a dictionary
# This is useful when you want to rename only certain columns
method_1_df = df_raw.rename({
"Region Name": "state",
"Variable Name": "gender",
"Age Name": "age_category",
"2008": "population"
})
print("✅ Method 1: Dictionary mapping for specific columns")
print("📝 New column names:")
print(method_1_df.columns)
```
#### Method 2: Dictionary Comprehension (All Columns)
```{python}
#| label: rename-all-columns
# Clean all column names: lowercase and replace spaces with underscores
# This applies the same transformation to ALL columns
df = df_raw.rename(
{col: col.lower().replace(" ", "_") for col in df_raw.columns}
)
print("✅ Method 2: Transform all column names with comprehension")
print("📝 New column names:")
print(df.columns)
```
#### Method 3: Using Lambda Function
```{python}
#| label: rename-with-lambda
# Alternative: Use a lambda function for complex transformations
method_3_df = df_raw.rename(
lambda col: col.strip().lower().replace(" ", "_").replace("-", "")
)
print("✅ Method 3: Lambda function for complex cleaning")
print("📝 New column names:")
print(method_3_df.columns)
```
::: {.callout-note}
## Polars Column Renaming Options
**Method 1: Dictionary with specific columns**
- Best for: Renaming a few specific columns
- Syntax: `df.rename({"old_name": "new_name"})`
- Leaves other columns unchanged
**Method 2: Dictionary comprehension**
- Best for: Applying same rule to all columns
- Syntax: `df.rename({col: transform(col) for col in df.columns})`
- Most flexible and commonly used
**Method 3: Lambda function**
- Best for: Complex transformations
- Syntax: `df.rename(lambda col: transform(col))`
- Clean and concise
**Comparison with Pandas:**
```python
# Pandas
df.rename(columns={"old": "new"}) # Note: columns= parameter
# Polars
df.rename({"old": "new"}) # No columns= parameter needed
```
:::
::: {.callout-tip}
## Common Column Cleaning Patterns
**Snake case (recommended):**
```python
col.lower().replace(" ", "_")
# "My Column" → "my_column"
```
**Remove special characters:**
```python
import re
re.sub(r'[^a-z0-9_]', '', col.lower().replace(" ", "_"))
# "My-Column!" → "my_column"
```
**Camel case:**
```python
''.join(word.capitalize() for word in col.split())
# "my column" → "MyColumn"
```
:::
::: {.callout-tip}
## Polars Expression Pattern
Polars uses a different syntax than pandas:
- **Pandas**: `df['column'].method()`
- **Polars**: `pl.col('column').method()`
This allows Polars to optimize queries before execution!
:::
### Step 9: Identify String Columns
Before cleaning text values, we need to know which columns contain text.
```{python}
#| label: identify-string-columns
# Find columns with Utf8 (string) data type
string_columns = [col for col, dtype in df.schema.items() if dtype == pl.Utf8]
print(f"📝 String columns to clean: {string_columns}")
```
### Step 10: Clean String Values
Text data often has inconsistencies: extra spaces, mixed capitalization. Let's fix that!
::: {.callout-note}
## Variable Progression in This Tutorial
As we clean our data, we use descriptive variable names to track the state:
- `df_raw` → Original data loaded from URL
- `df` → Column names cleaned (snake_case)
- `df_cleaned` → String values cleaned + data types fixed
- `census_df` → Final dataset with extracted gender column (using expressions)
Notice how we use **Polars expressions** to transform data efficiently without intermediate copies!
:::
```{python}
#| label: clean-string-values
# Clean all string columns using Polars expressions
# - str.strip_chars() removes leading/trailing whitespace
# - str.to_titlecase() capitalizes first letter of each word
df_cleaned = df.with_columns([
pl.col(col).str.strip_chars().str.to_titlecase()
for col in string_columns
])
print("✅ String values cleaned!")
df_cleaned.head()
```
::: {.callout-note}
## Polars List Comprehensions
Polars uses list comprehensions with `pl.col()` to apply operations to multiple columns:
```python
df.with_columns([
pl.col(col).transform()
for col in columns
])
```
This is efficient and optimized by Polars' query engine!
:::
### Step 11: Ensure Correct Data Types
Sometimes numeric columns get read as text. Let's ensure `2008` column is numeric.
```{python}
#| label: fix-data-types
# Convert population column to integer type
df_cleaned = df_cleaned.with_columns([
pl.col("2008").cast(pl.Int64, strict=False).alias("population")
])
print(f"✅ Population column type: {df_cleaned['population'].dtype}")
print(f"❓ Missing values after conversion: {df_cleaned['population'].null_count()}")
```
### Step 12: String Splitting - Extracting Gender Information
Now let's explore the `variable_name` column and extract meaningful information from it. This column contains structured text like "Population - Male (Number)" that we can split and parse.
```{python}
#| label: explore-variable-name
# Examine what's in the variable_name column
print("🔍 Unique values in variable_name column:")
print(df_cleaned["variable_name"].unique())
```
::: {.callout-tip}
## String Splitting Strategy
The `variable_name` column follows a pattern: "Category - Gender (Type)"
We want to extract the middle piece (Male, Female, Total) using Polars string methods!
:::
#### Method 1: Using `.str.split()` with Index Selection
The simplest approach: split on delimiter and select the piece you want.
```{python}
#| label: method-1-split
# Split on ' - ' and extract the second part (index 1)
method_1_df = df_cleaned.with_columns([
pl.col("variable_name")
.str.split(" ")
.list.get(1) # Get second element (index 1)
.alias("gender_extracted")
])
print("✅ Method 1: Basic split with index")
print(method_1_df.select(["variable_name", "gender_extracted"]).unique())
```
::: {.callout-note}
## How `.str.split()` Works in Polars
- `str.split(' ')` breaks the string at each occurrence of ' - '
- Returns a **List** data type in Polars
- `.list.get(1)` selects the second element (0-based indexing)
- Much faster than pandas due to vectorized Rust implementation!
:::
#### Method 2: Chain Multiple String Operations
Sometimes you need to clean up the extracted text further.
```{python}
#| label: method-2-chained-split
# Split twice: first on ' - ', then remove the parenthetical part
method_2_df = df_cleaned.with_columns([
pl.col("variable_name")
.str.split(",")
.list.get(1) # Get middle section
.str.split(" (")
.list.get(0) # Remove (Number) part
.alias("gender_clean")
])
print("✅ Method 2: Chained splits for clean extraction")
print(method_2_df.select(["variable_name", "gender_clean"]).unique())
```
#### Method 3: Using `.str.extract()` with Regex
For more complex patterns, regular expressions provide powerful extraction.
```{python}
#| label: method-3-regex
# Use regex pattern to extract text between ' - ' and ' ('
method_3_df = df_cleaned.with_columns([
pl.col("variable_name")
.str.extract(r", (\w+) \(", group_index=1)
.alias("gender_regex")
])
print("✅ Method 3: Regex extraction")
print(method_3_df.select(["variable_name", "gender_regex"]).unique())
```
::: {.callout-note}
## Understanding the Regex Pattern
`r" - (\w+) \("` breaks down as:
- `r` = raw string (treats backslashes literally)
- ` - ` = matches literal " - "
- `(\w+)` = captures one or more word characters (letters/digits)
- ` \(` = matches literal " (" (backslash escapes the parenthesis)
The parentheses `()` create a "capture group" that `.extract()` returns.
:::
#### Method 4: Using `.str.replace()` for Pattern Removal
Remove unwanted parts and extract what remains.
```{python}
#| label: method-4-replace
# Remove the prefix and suffix to isolate the gender
method_4_df = df_cleaned.with_columns([
pl.col("variable_name")
.str.replace("Population, ", "") # Remove prefix
.str.replace(r" \(Number\)", "") # Remove suffix
.alias("gender_replace")
])
print("✅ Method 4: String replacement")
print(method_4_df.select(["variable_name", "gender_replace"]).unique())
```
#### Method 5: One-liner with Polars Expressions
Combine everything into a compact, readable expression. This creates our final cleaned dataframe: `census_df`.
```{python}
#| label: method-5-final
# The most elegant solution: all-in-one chained extraction
# This is our final cleaned dataset with extracted gender information
census_df = (
df_cleaned.with_columns([
pl.col("variable_name")
.str.split(" ")
.list.get(1) # Extract middle piece
.str.split(" (")
.list.get(0) # Remove parenthetical
.str.strip_chars() # Remove any extra whitespace
.alias("gender")
])
.rename({"age_name": "age_category"})
)
print("✅ Method 5: Complete chained solution")
print(census_df.select(["variable_name", "gender"]).unique())
# Verify the extraction worked
print("\n🎯 Unique gender values extracted:")
print(census_df["gender"].unique())
```
::: {.callout-tip}
## Which Method Should You Use?
**Choose based on your needs:**
| Method | Best For | Pros | Cons |
|--------|----------|------|------|
| **Method 1** | Simple, consistent patterns | Fast, readable | Requires fixed structure |
| **Method 2** | Multi-step cleaning | Very clear logic | More verbose |
| **Method 3** | Complex patterns | Most flexible | Requires regex knowledge |
| **Method 4** | Known prefix/suffix | Intuitive | Less robust to variations |
| **Method 5** | Production code | Compact, professional | Harder for beginners |
For this tutorial, **Method 5** is ideal: it's clean, readable, and leverages Polars' optimization!
:::
#### Bonus: Cleaning Age Categories
Now let's also clean the `age_category` column to standardize the age ranges.
```{python}
#| label: explore-age-categories
# First, let's see what age categories exist
print("🔍 Current age categories:")
print(census_df["age_category"].unique().sort())
```
We can use Polars' powerful expression system to standardize these age ranges. Let's create standardized age brackets: **0-14, 15-24, 25-34, 35-44, 45-54, 55-64, 65+**
**Method 1: Using `pl.when().then().otherwise()` (Polars' case_when)**
```{python}
#| label: age-cleaning-method-1
# Method 1: Using Polars' when-then-otherwise (similar to SQL CASE WHEN)
census_clean = census_df.with_columns(
age_category=pl.when(pl.col("age_category").is_in(["0 To 4", "5 To 9", "10 To 14"]))
.then(pl.lit("0-14"))
.when(pl.col("age_category").is_in(["15 To 19", "20 To 24"]))
.then(pl.lit("15-24"))
.when(pl.col("age_category").is_in(["25 To 29", "30 To 34"]))
.then(pl.lit("25-34"))
.when(pl.col("age_category").is_in(["35 To 39", "40 To 44"]))
.then(pl.lit("35-44"))
.when(pl.col("age_category").is_in(["45 To 49", "50 To 54"]))
.then(pl.lit("45-54"))
.when(pl.col("age_category").is_in(["55 To 59", "60 To 64"]))
.then(pl.lit("55-64"))
.when(pl.col("age_category") == "65+")
.then(pl.lit("65+"))
.otherwise(pl.col("age_category"))
)
print("✅ Age categories standardized (Method 1: when-then-otherwise)!")
print("\n🎯 New age categories:")
print(census_df["age_category"].unique().sort())
```
**Method 2: Using `.str.replace_all()` with Multiple Patterns**
```{python}
#| label: age-cleaning-method-2-demo
# Method 2: Using replace_all for demonstration
# (We'll keep Method 1's result, but showing this for learning)
# Create a test dataframe to demonstrate Method 2
df_test = census_df.with_columns([
pl.col("age_category")
.str.replace_all("0 To 4", "0-14")
.str.replace_all("5 To 9", "0-14")
.str.replace_all("10 To 14", "0-14")
.str.replace_all("15 To 19", "15-24")
.str.replace_all("20 To 24", "15-24")
.str.replace_all("25 To 29", "25-34")
.str.replace_all("30 To 34", "25-34")
.str.replace_all("35 To 39", "35-44")
.str.replace_all("40 To 44", "35-44")
.str.replace_all("45 To 49", "45-54")
.str.replace_all("50 To 54", "45-54")
.str.replace_all("55 To 59", "55-64")
.str.replace_all("60 To 64", "55-64")
.alias("age_category_alt")
])
print("✅ Method 2 result (for demonstration):")
print(df_test["age_category_alt"].unique().sort())
```
::: {.callout-note}
## Comparing Polars Methods for Categorical Mapping
**When to use `when().then().otherwise()`:**
- Conditional logic with multiple criteria
- Most readable for complex conditions
- Similar to SQL CASE WHEN
- Polars optimizes these expressions automatically
**When to use `.str.replace_all()`:**
- Simple string find-and-replace
- Known patterns that don't overlap
- Quick and straightforward
**Performance Note:**
Both methods are extremely fast in Polars due to:
- Vectorized operations in Rust
- Query optimization engine
- Parallel execution when possible
For our use case, `when-then` is clearer and more maintainable!
:::
### Performance Comparison in Polars
Let's test which method is more efficient:
```{python}
#| label: performance-comparison
import time
# Create a larger test dataset for meaningful timing
test_df = census_clean
# Replicate the data 1000 times
test_df = pl.concat([test_df] * 10_000)
print(f"Test dataset size: {test_df.shape[0]:,} rows\n")
# Method 1: when-then-otherwise
start_time = time.time()
result1 = test_df.with_columns([
pl.when(pl.col("age_category").is_in(["0 To 4", "5 To 9", "10 To 14"]))
.then(pl.lit("0-14"))
.when(pl.col("age_category").is_in(["15 To 19", "20 To 24"]))
.then(pl.lit("15-24"))
.when(pl.col("age_category").is_in(["25 To 29", "30 To 34"]))
.then(pl.lit("25-34"))
.when(pl.col("age_category").is_in(["35 To 39", "40 To 44"]))
.then(pl.lit("35-44"))
.when(pl.col("age_category").is_in(["45 To 49", "50 To 54"]))
.then(pl.lit("45-54"))
.when(pl.col("age_category").is_in(["55 To 59", "60 To 64"]))
.then(pl.lit("55-64"))
.when(pl.col("age_category") == "65+")
.then(pl.lit("65+"))
.otherwise(pl.col("age_category"))
])
time_when = time.time() - start_time
# Method 2: str.replace_all (chained)
start_time = time.time()
result2 = test_df.with_columns([
pl.col("age_category")
.str.replace_all("0 To 4", "0-14")
.str.replace_all("5 To 9", "0-14")
.str.replace_all("10 To 14", "0-14")
.str.replace_all("15 To 19", "15-24")
.str.replace_all("20 To 24", "15-24")
.str.replace_all("25 To 29", "25-34")
.str.replace_all("30 To 34", "25-34")
.str.replace_all("35 To 39", "35-44")
.str.replace_all("40 To 44", "35-44")
.str.replace_all("45 To 49", "45-54")
.str.replace_all("50 To 54", "45-54")
.str.replace_all("55 To 59", "55-64")
.str.replace_all("60 To 64", "55-64")
.alias("age_category_alt")
])
time_replace = time.time() - start_time
# Display results
print("⏱️ Performance Comparison:")
print(f" when-then: {time_when*1000:.2f} ms")
print(f" str.replace: {time_replace*1000:.2f} ms")
if time_when < time_replace:
print(f"\n🏆 Winner: when-then is {time_replace/time_when:.1f}x faster")
else:
print(f"\n🏆 Winner: str.replace is {time_when/time_replace:.1f}x faster")
```
::: {.callout-important}
## Why Polars is So Fast
**Polars advantages over pandas:**
1. **Rust backend**: Compiled code is much faster than Python
2. **Query optimization**: Like a SQL database, Polars optimizes before execution
3. **Parallelization**: Automatically uses all CPU cores
4. **Memory efficiency**: Arrow memory format reduces copying
5. **Lazy evaluation**: Combines operations for maximum efficiency
For our categorical mapping, both methods are optimized by Polars' query engine!
:::
### Step 13: Verify the Extraction
Let's confirm our extracted values match what we expect.
```{python}
#| label: verify-extraction
# Check the distribution of gender values
gender_counts = (
census_clean
.group_by("gender")
.agg([
pl.count().alias("count"),
pl.col("population").sum().alias("total_population")
])
.sort("total_population", descending=True)
)
print("📊 Gender distribution in extracted data:")
print(gender_counts)
```
### Step 14: Prepare Data for Analysis
Let's create our final analysis-ready dataframe.
```{python}
#| label: prepare-final-data
# Select only the columns we need for analysis
census_clean_df = (
census_clean
.rename({"region_name": "state"})
.select(
"state",
"gender",
"age_category",
"population"
)
)
print("✅ Ready for analysis!")
print(f"📋 Final dataset: {census_clean_df.shape[0]:,} rows × {census_clean_df.shape[1]} columns")
print("\nColumn types:")
print(census_clean_df.schema)
```
```{python}
# Filtering row values
census = (
census_clean_df
.filter(pl.col("gender") != "Total", pl.col("age_category") != "Total")
)
# Inspect output
print(census)
```
### Step 15: Final Cleaned Dataset Summary
```{python}
#| label: final-summary
# Summary of our cleaned data
print("=" * 50)
print("🎉 DATA CLEANING COMPLETE!")
print("=" * 50)
print(f"📋 Final shape: {census.shape[0]:,} rows × {census.shape[1]} columns")
print(f"📝 Columns: {census.columns}")
print("\n🔍 Preview:")
census.head()
```
---
## Part 5: Data Analysis
Now that our data is clean, let's calculate some insights!
### Step 16: Calculate Overview Metrics
```{python}
#| label: overview-metrics
# Build a summary of key statistics
overview_stats = pl.DataFrame({
"metric": [
"Total Population",
"Number of States",
"Age Categories",
"Gender Groups"
],
"value": [
census["population"].sum(),
census["state"].n_unique(),
census["age_category"].n_unique(),
census["gender"].n_unique()
]
})
print("📊 Overview Statistics:")
print(overview_stats)
```
### Step 17: National Gender Distribution
```{python}
#| label: gender-distribution
# Calculate gender distribution
gender_summary = (
census
.group_by("gender")
.agg([
pl.col("population").sum().alias("total_population")
])
.with_columns([
(pl.col("total_population") / pl.col("total_population").sum() * 100)
.round(2)
.alias("percentage")
])
.sort("total_population", descending=True)
)
# Display with Great Tables
(
GT(gender_summary.to_pandas())
.tab_header(
title="National Gender Distribution",
subtitle="South Sudan 2008 Census"
)
.fmt_number(
columns="total_population",
decimals=0,
use_seps=True
)
.fmt_number(
columns="percentage",
decimals=2
)
.cols_label(
gender="Gender",
total_population="Total Population",
percentage="Percentage (%)"
)
)
```
### Step 18: Age Distribution Analysis
```{python}
#| label: age-distribution
# Calculate age distribution
age_summary = (
census
.group_by("age_category")
.agg([
pl.col("population").sum().alias("total_population")
])
.with_columns([
(pl.col("total_population") / pl.col("total_population").sum() * 100)
.round(2)
.alias("percentage")
])
.sort("total_population", descending=True)
)
(
GT(age_summary.to_pandas())
.tab_header(
title="National Age Distribution",
subtitle="Standardized Age Categories"
)
.fmt_number(
columns="total_population",
decimals=0,
use_seps=True
)
.fmt_number(
columns="percentage",
decimals=2
)
.cols_label(
age_category="Age Category",
total_population="Total Population",
percentage="Percentage (%)"
)
)
```
### Step 20: Visualize Age Distribution
```{python}
#| label: viz-age-bar
#| fig-width: 12
#| fig-height: 6
# Create bar chart
fig_age = (
ggplot(age_summary, aes("age_category", "total_population", fill="total_population")) +
geom_col(alpha=0.85) +
scale_fill_gradient(low="#06b6d4", high="#22d3ee") +
labs(
title="National Age Distribution",
subtitle="Standardized Age Categories",
x="Age Category",
y="Population"
) +
geom_text(
aes(label="percentage"),
format_string="{:.1f}%",
va="bottom",
size=10
) +
scale_y_continuous(labels=lambda l: [f"{int(x/1000)}K" for x in l]) +
pystatr_theme +
theme(
figure_size=(12, 6),
axis_text_x=element_text(rotation=45, ha="right"),
legend_position="none"
)
)
fig_age
```
### Step 21: State-Level Analysis
```{python}
#| label: state-analysis
# Top 10 states by population
state_summary = (
census
.group_by("state")
.agg([
pl.col("population").sum().alias("total_population")
])
.sort("total_population", descending=True)
.head(5)
)
(
GT(state_summary.to_pandas())
.tab_header(
title="2008 Census by State",
subtitle="Ranked by Total Population"
)
.fmt_number(
columns="total_population",
decimals=0,
use_seps=True
)
.cols_label(
state="State",
total_population="Total Population"
)
)
```
### Step 22: Visualize State Populations
```{python}
#| label: viz-state-bars
#| fig-width: 12
#| fig-height: 8
# Create horizontal bar chart
fig_states = (
ggplot(state_summary , aes("reorder(state, total_population)", "total_population", fill="total_population")) +
geom_col(alpha=0.85) +
coord_flip() +
scale_fill_gradient(low="#1E3A8A", high="#1E3A8A") +
labs(
title="Top 10 Most Populous States",
subtitle="South Sudan 2008 Census",
x="State",
y="Population"
) +
scale_y_continuous(labels=lambda l: [f"{int(x/1000)}K" for x in l], limits=(0, 1_500_000)) +
pystatr_theme +
theme(
plot_title=element_text(color="#1E3A8A"),
figure_size=(12, 8),
legend_position="none"
)
)
fig_states
```
---
## Key Insights
::: {.callout-important}
## What the Data Tells Us
### 1. Population Concentration
The top five states account for a significant portion of the national population. This geographic concentration should inform resource allocation decisions.
### 2. Youth Demographics
The age distribution reveals a young population—typical of developing nations. This "youth bulge" represents both opportunity (workforce potential) and challenge (education, employment needs).
### 3. Gender Balance
Most states show relatively balanced gender distributions, with some variation that may reflect migration patterns, conflict impacts, or data collection methodology.
### 4. Regional Disparities
Substantial population differences between states suggest the need for differentiated development strategies rather than uniform national approaches.
:::
---
## Conclusion
Congratulations! You've completed a full demographic analysis using Polars and modern Python tools.
**What You Learned:**
✅ Loading data with `pl.read_csv()`
✅ Exploring data structure with Polars methods
✅ Cleaning data with Polars expressions
✅ **String splitting and extraction** with multiple methods
✅ **Age category standardization** using conditional logic
✅ Aggregating with `.group_by()` and `.agg()`
✅ Creating beautiful visualizations with plotnine
✅ Building professional tables with Great Tables
**The techniques you've learned apply to ANY dataset**—from business analytics to scientific research.
::: {.callout-tip}
## Why Choose Polars?
**Speed**: 5-10x faster than pandas on large datasets
**Memory**: More efficient Arrow memory format
**Syntax**: Clean, expressive API
**Optimization**: Automatic query optimization like SQL databases
**Future-proof**: Built for modern data science workflows
For large datasets (>1GB), Polars can save hours of processing time!
:::
::: {.callout-tip}
## Why Choose plotnine?
**Familiar**: Uses ggplot2 grammar (beloved by R users)
**Declarative**: Build complex plots with readable, layered syntax
**Reproducible**: Same code produces same results across platforms
**Professional**: Publication-quality graphics out of the box
**Flexible**: Easy to customize every aspect of your plots
Perfect for data scientists who want beautiful, maintainable visualization code!
:::
::: {.callout-tip}
## Next Steps
1. Try the lazy API with `pl.scan_csv()` for even faster processing
2. Explore Polars' window functions for advanced analytics
3. Practice with your own datasets
4. Experiment with different plotnine themes and geometries
5. Compare performance with pandas on large files
:::
---
## Technical Reference
### Libraries Used
| Library | Version | Purpose |
|---------|---------|---------|
| **polars** | 0.20+ | Fast data manipulation with Rust backend |
| **plotnine** | 0.12+ | Grammar of graphics visualization (ggplot2 for Python) |
| **Great Tables** | 0.8+ | Professional table formatting |
### Key Techniques Demonstrated
| Technique | Description |
|-----------|-------------|
| **Polars Expressions** | Optimized, chainable data transformations |
| **String Splitting** | Extract structured info from text with `.str` methods |
| **Conditional Logic** | `when().then().otherwise()` for complex mappings |
| **List Operations** | `.list.get()` for working with split strings |
| **Aggregations** | `.group_by().agg()` for summary statistics |
| **Grammar of Graphics** | Layered visualization approach with plotnine |
---
*Tutorial by Alierwai Reng, Founder of PyStatR+. For data science training or collaboration, reach out at info@pystatrplus.org*
---
::: {.callout-tip icon="false"}
## About the Author
Alierwai Reng is the Founder and Lead Educator of PyStatR+, a data science educator, and analytics leader with expertise in statistics and healthcare analytics. His mission is to make technical knowledge accessible through clear, beginner-friendly education. He believes in "Education from the Heart."
**For training, consulting, or collaboration opportunities:**
📧 info@pystatrplus.org
🌐 [pystatrplus.org](https://pystatrplus.org)
:::
::: {.callout-note icon="false"}
## Editor's Note
This tutorial reflects PyStatR+'s core philosophy: that data science education should be accessible, practical, and empowering. We believe the best learning happens when complexity is distilled into clarity—without sacrificing rigor.
At PyStatR+, we teach from the heart by putting ourselves in your shoes—because learning is a partnership, not a solitary journey.
**PyStatR+**: *Learning Simplified. Communication Amplified.*
:::