Interactive Demographic Explorer: South Sudan 2008 Census

A Beginner’s Guide to Data Analysis with Polars and Plotly

Python
Polars
Interactive Visualization
Demographics
Census
Learn data analysis step-by-step! This beginner-friendly tutorial explores South Sudan’s demographics using Polars for fast data processing and Plotly for interactive visualizations.
Author

Alierwai Reng

Published

November 20, 2024

Introduction

Welcome to this hands-on data analysis tutorial! By the end of this guide, you’ll understand how to:

  • Load and explore real-world census data
  • Clean data using Polars expressions (fast and efficient)
  • Split and transform strings to extract meaningful information
  • Calculate summary statistics and group-level metrics
  • Create beautiful visualizations with plotnine (ggplot2 for Python)

We’ll analyze South Sudan’s 2008 census data, but the techniques you learn apply to any dataset.

TipWhat is Polars?

Polars is a blazingly fast DataFrame library built in Rust with Python bindings. It’s designed for speed and memory efficiency, often outperforming pandas by 5-10x on large datasets. Polars uses a query optimization engine and lazy evaluation to make your data processing incredibly fast!

TipWhat is plotnine?

Plotnine is a Python implementation of R’s ggplot2, using the Grammar of Graphics. It provides a declarative, layered approach to creating visualizations, making it easy to build complex plots with readable code. Perfect for data scientists familiar with R or those who want reproducible, publication-quality graphics!


Part 1: Environment Setup

Step 1: Import Required Libraries

Every Python analysis starts with importing the tools we need. Think of libraries as specialized toolboxes—each designed for specific tasks.

# Data manipulation
import polars as pl          # The blazingly fast DataFrame library
import polars.selectors as cs  # Column selectors for easy column operations

# Visualization
from plotnine import *       # Grammar of graphics (ggplot2 for Python)
import plotnine as p9

# Tables
from great_tables import GT   # Professional table formatting
import great_tables as gt


print("✅ All libraries loaded successfully!")
print(f"1) 📊 Polars version: {pl.__version__};\n2) plotnine version: {p9.__version__};\n3) Great Tables version: {gt.__version__}")
✅ All libraries loaded successfully!
1) 📊 Polars version: 1.36.1;
2) plotnine version: 0.15.2;
3) Great Tables version: 0.20.0
NoteLibrary Purposes
Library What It Does
polars Fast, memory-efficient data manipulation with Rust backend
polars.selectors Easy column selection (similar to tidyselect in R)
plotnine Grammar of graphics visualization (ggplot2 port for Python)
great_tables Makes publication-quality tables

Step 2: Configure Visualization Theme

Consistent styling makes your visualizations look professional. We’ll create a PyStatR+ branded theme for plotnine.

# Define custom theme for plotnine
pystatr_theme = (
    theme_minimal() +
    theme(
        text=element_text(family="sans-serif", size=12),
        plot_title=element_text(size=16, weight="bold", color="#06b6d4"),
        plot_subtitle=element_text(size=12, color="#666666"),
        axis_text=element_text(size=10),
        legend_position="top",
        panel_grid_major=element_line(color="#e5e5e5"),
        panel_grid_minor=element_blank(),
        plot_background=element_rect(fill="white"),
        panel_background=element_rect(fill="white")
    )
)

# Define custom color palette
pystatr_colors = ["#22d3ee", "#FFD700", "#06b6d4", "#0891b2", "#0e7490"]

print("🎨 Custom theme configured!")
🎨 Custom theme configured!

Part 2: Loading the Data

Step 3: Load Census Data from URL

Real datasets often live on the web. Polars can read directly from URLs—no download required!

# Define the data source URL
data_url = (
    "https://raw.githubusercontent.com/tongakuot/r_tutorials/refs/heads/main/00-input/ss_2008_census_data_raw.csv"
)

# Load the data into a DataFrame
df_raw = pl.read_csv(
    data_url,
    schema_overrides={"2008": pl.Utf8},   # force string
    null_values=["", "NA", "N/A", "null"]
)

# Confirm successful load with basic info
print("✅ Data loaded successfully!")
print(f"📋 Rows: {df_raw.shape[0]:,}")
print(f"📋 Columns: {df_raw.shape[1]}")
✅ Data loaded successfully!
📋 Rows: 453
📋 Columns: 10
TipPolars vs Pandas Naming

Polars uses lazy evaluation by default with scan_csv(). For this tutorial, we use read_csv() which loads data immediately (eager execution). Both approaches work, but lazy evaluation is more efficient for large datasets!


Part 3: Data Exploration

Before cleaning, we must understand our data. What columns exist? What types are they? Are there issues?

Step 4: Examine Column Names

# View the original column names and types
print(f"📝 Column names and types: {df_raw.schema}")
📝 Column names and types: Schema({'Region': String, 'Region Name': String, 'Region - RegionId': String, 'Variable': String, 'Variable Name': String, 'Age': String, 'Age Name': String, 'Scale': String, 'Units': String, '2008': String})

Notice any issues? Inconsistent capitalization, spaces, or special characters make columns harder to work with.

Step 5: Check Data Types

# Polars data types
print("🔍 Data types:")
for col, dtype in df_raw.schema.items():
    print(f"  {col}: {dtype}")
🔍 Data types:
  Region: String
  Region Name: String
  Region - RegionId: String
  Variable: String
  Variable Name: String
  Age: String
  Age Name: String
  Scale: String
  Units: String
  2008: String
NoteUnderstanding Polars Data Types
Type Meaning Examples
Int64 64-bit integers 1, 42, 1000
Float64 64-bit floats 3.14, 99.9
Utf8 String (text) “Male”, “Juba”
Date Calendar dates 2024-01-01

Step 6: Preview the Data

# View the first 5 rows
print(df_raw.head())
shape: (5, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬─────────┬────────┐
│ Region ┆ Region Name ┆ Region - RegionId ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units   ┆ 2008   │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---     ┆ ---    │
│ str    ┆ str         ┆ str               ┆ str      ┆   ┆ str      ┆ str   ┆ str     ┆ str    │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═════════╪════════╡
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ Total    ┆ units ┆ Persons ┆ 964353 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 0 to 4   ┆ units ┆ Persons ┆ 150872 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 5 to 9   ┆ units ┆ Persons ┆ 151467 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 10 to 14 ┆ units ┆ Persons ┆ 126140 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 15 to 19 ┆ units ┆ Persons ┆ 103804 │
└────────┴─────────────┴───────────────────┴──────────┴───┴──────────┴───────┴─────────┴────────┘

Step 7: Check for Missing Values

# Count missing values in each column
missing_counts = df_raw.null_count()

print("❓ Missing values per column:")
print(missing_counts)
❓ Missing values per column:
shape: (1, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬───────┬──────┐
│ Region ┆ Region Name ┆ Region - RegionId ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units ┆ 2008 │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---   ┆ ---  │
│ u32    ┆ u32         ┆ u32               ┆ u32      ┆   ┆ u32      ┆ u32   ┆ u32   ┆ u32  │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═══════╪══════╡
│ 1      ┆ 1           ┆ 3                 ┆ 3        ┆ … ┆ 3        ┆ 3     ┆ 3     ┆ 3    │
└────────┴─────────────┴───────────────────┴──────────┴───┴──────────┴───────┴───────┴──────┘

Part 4: Data Cleaning with Polars Expressions

Now we transform our raw data into analysis-ready format using Polars expressions.

Step 8: Clean Column Names

Polars offers multiple ways to rename columns. Let’s explore different approaches for different use cases.

Method 1: Dictionary Mapping (Rename Specific Columns)

# Example: Rename specific columns using a dictionary
# This is useful when you want to rename only certain columns

method_1_df = df_raw.rename({
    "Region Name": "state",
    "Variable Name": "gender",
    "Age Name": "age_category",
    "2008": "population"
})

print("✅ Method 1: Dictionary mapping for specific columns")
print("📝 New column names:")
print(method_1_df.columns)
✅ Method 1: Dictionary mapping for specific columns
📝 New column names:
['Region', 'state', 'Region - RegionId', 'Variable', 'gender', 'Age', 'age_category', 'Scale', 'Units', 'population']

Method 2: Dictionary Comprehension (All Columns)

# Clean all column names: lowercase and replace spaces with underscores
# This applies the same transformation to ALL columns
df = df_raw.rename(
    {col: col.lower().replace(" ", "_") for col in df_raw.columns}
)

print("✅ Method 2: Transform all column names with comprehension")
print("📝 New column names:")
print(df.columns)
✅ Method 2: Transform all column names with comprehension
📝 New column names:
['region', 'region_name', 'region_-_regionid', 'variable', 'variable_name', 'age', 'age_name', 'scale', 'units', '2008']

Method 3: Using Lambda Function

# Alternative: Use a lambda function for complex transformations
method_3_df = df_raw.rename(
    lambda col: col.strip().lower().replace(" ", "_").replace("-", "")
)

print("✅ Method 3: Lambda function for complex cleaning")
print("📝 New column names:")
print(method_3_df.columns)
✅ Method 3: Lambda function for complex cleaning
📝 New column names:
['region', 'region_name', 'region__regionid', 'variable', 'variable_name', 'age', 'age_name', 'scale', 'units', '2008']
NotePolars Column Renaming Options

Method 1: Dictionary with specific columns - Best for: Renaming a few specific columns - Syntax: df.rename({"old_name": "new_name"}) - Leaves other columns unchanged

Method 2: Dictionary comprehension - Best for: Applying same rule to all columns - Syntax: df.rename({col: transform(col) for col in df.columns}) - Most flexible and commonly used

Method 3: Lambda function - Best for: Complex transformations - Syntax: df.rename(lambda col: transform(col)) - Clean and concise

Comparison with Pandas:

# Pandas
df.rename(columns={"old": "new"})  # Note: columns= parameter

# Polars  
df.rename({"old": "new"})  # No columns= parameter needed
TipCommon Column Cleaning Patterns

Snake case (recommended):

col.lower().replace(" ", "_")
# "My Column" → "my_column"

Remove special characters:

import re
re.sub(r'[^a-z0-9_]', '', col.lower().replace(" ", "_"))
# "My-Column!" → "my_column"

Camel case:

''.join(word.capitalize() for word in col.split())
# "my column" → "MyColumn"
TipPolars Expression Pattern

Polars uses a different syntax than pandas: - Pandas: df['column'].method() - Polars: pl.col('column').method()

This allows Polars to optimize queries before execution!

Step 9: Identify String Columns

Before cleaning text values, we need to know which columns contain text.

# Find columns with Utf8 (string) data type
string_columns = [col for col, dtype in df.schema.items() if dtype == pl.Utf8]

print(f"📝 String columns to clean: {string_columns}")
📝 String columns to clean: ['region', 'region_name', 'region_-_regionid', 'variable', 'variable_name', 'age', 'age_name', 'scale', 'units', '2008']

Step 10: Clean String Values

Text data often has inconsistencies: extra spaces, mixed capitalization. Let’s fix that!

NoteVariable Progression in This Tutorial

As we clean our data, we use descriptive variable names to track the state: - df_raw → Original data loaded from URL - df → Column names cleaned (snake_case) - df_cleaned → String values cleaned + data types fixed - census_df → Final dataset with extracted gender column (using expressions)

Notice how we use Polars expressions to transform data efficiently without intermediate copies!

# Clean all string columns using Polars expressions
# - str.strip_chars() removes leading/trailing whitespace
# - str.to_titlecase() capitalizes first letter of each word

df_cleaned = df.with_columns([
    pl.col(col).str.strip_chars().str.to_titlecase()
    for col in string_columns
])

print("✅ String values cleaned!")
df_cleaned.head()
✅ String values cleaned!
shape: (5, 10)
region region_name region_-_regionid variable variable_name age age_name scale units 2008
str str str str str str str str str str
"Kn.A2" "Upper Nile" "Ss-Nu" "Kn.B2" "Population, Total (Number)" "Kn.C1" "Total" "Units" "Persons" "964353"
"Kn.A2" "Upper Nile" "Ss-Nu" "Kn.B2" "Population, Total (Number)" "Kn.C2" "0 To 4" "Units" "Persons" "150872"
"Kn.A2" "Upper Nile" "Ss-Nu" "Kn.B2" "Population, Total (Number)" "Kn.C3" "5 To 9" "Units" "Persons" "151467"
"Kn.A2" "Upper Nile" "Ss-Nu" "Kn.B2" "Population, Total (Number)" "Kn.C4" "10 To 14" "Units" "Persons" "126140"
"Kn.A2" "Upper Nile" "Ss-Nu" "Kn.B2" "Population, Total (Number)" "Kn.C5" "15 To 19" "Units" "Persons" "103804"
NotePolars List Comprehensions

Polars uses list comprehensions with pl.col() to apply operations to multiple columns:

df.with_columns([
    pl.col(col).transform() 
    for col in columns
])

This is efficient and optimized by Polars’ query engine!

Step 11: Ensure Correct Data Types

Sometimes numeric columns get read as text. Let’s ensure 2008 column is numeric.

# Convert population column to integer type
df_cleaned = df_cleaned.with_columns([
    pl.col("2008").cast(pl.Int64, strict=False).alias("population")
])

print(f"✅ Population column type: {df_cleaned['population'].dtype}")
print(f"❓ Missing values after conversion: {df_cleaned['population'].null_count()}")
✅ Population column type: Int64
❓ Missing values after conversion: 3

Step 12: String Splitting - Extracting Gender Information

Now let’s explore the variable_name column and extract meaningful information from it. This column contains structured text like “Population - Male (Number)” that we can split and parse.

# Examine what's in the variable_name column
print("🔍 Unique values in variable_name column:")
print(df_cleaned["variable_name"].unique())
🔍 Unique values in variable_name column:
shape: (4,)
Series: 'variable_name' [str]
[
    "Population, Male (Number)"
    null
    "Population, Total (Number)"
    "Population, Female (Number)"
]
TipString Splitting Strategy

The variable_name column follows a pattern: “Category - Gender (Type)”

We want to extract the middle piece (Male, Female, Total) using Polars string methods!

Method 1: Using .str.split() with Index Selection

The simplest approach: split on delimiter and select the piece you want.

# Split on ' - ' and extract the second part (index 1)
method_1_df = df_cleaned.with_columns([
    pl.col("variable_name")
    .str.split(" ")
    .list.get(1)  # Get second element (index 1)
    .alias("gender_extracted")
])

print("✅ Method 1: Basic split with index")
print(method_1_df.select(["variable_name", "gender_extracted"]).unique())
✅ Method 1: Basic split with index
shape: (4, 2)
┌─────────────────────────────┬──────────────────┐
│ variable_name               ┆ gender_extracted │
│ ---                         ┆ ---              │
│ str                         ┆ str              │
╞═════════════════════════════╪══════════════════╡
│ Population, Male (Number)   ┆ Male             │
│ Population, Total (Number)  ┆ Total            │
│ null                        ┆ null             │
│ Population, Female (Number) ┆ Female           │
└─────────────────────────────┴──────────────────┘
NoteHow .str.split() Works in Polars
  • str.split(' ') breaks the string at each occurrence of ’ - ’
  • Returns a List data type in Polars
  • .list.get(1) selects the second element (0-based indexing)
  • Much faster than pandas due to vectorized Rust implementation!

Method 2: Chain Multiple String Operations

Sometimes you need to clean up the extracted text further.

# Split twice: first on ' - ', then remove the parenthetical part
method_2_df = df_cleaned.with_columns([
    pl.col("variable_name")
    .str.split(",")
    .list.get(1)  # Get middle section
    .str.split(" (")
    .list.get(0)  # Remove (Number) part
    .alias("gender_clean")
])

print("✅ Method 2: Chained splits for clean extraction")
print(method_2_df.select(["variable_name", "gender_clean"]).unique())
✅ Method 2: Chained splits for clean extraction
shape: (4, 2)
┌─────────────────────────────┬──────────────┐
│ variable_name               ┆ gender_clean │
│ ---                         ┆ ---          │
│ str                         ┆ str          │
╞═════════════════════════════╪══════════════╡
│ null                        ┆ null         │
│ Population, Female (Number) ┆  Female      │
│ Population, Total (Number)  ┆  Total       │
│ Population, Male (Number)   ┆  Male        │
└─────────────────────────────┴──────────────┘

Method 3: Using .str.extract() with Regex

For more complex patterns, regular expressions provide powerful extraction.

# Use regex pattern to extract text between ' - ' and ' ('
method_3_df = df_cleaned.with_columns([
    pl.col("variable_name")
    .str.extract(r", (\w+) \(", group_index=1)
    .alias("gender_regex")
])

print("✅ Method 3: Regex extraction")
print(method_3_df.select(["variable_name", "gender_regex"]).unique())
✅ Method 3: Regex extraction
shape: (4, 2)
┌─────────────────────────────┬──────────────┐
│ variable_name               ┆ gender_regex │
│ ---                         ┆ ---          │
│ str                         ┆ str          │
╞═════════════════════════════╪══════════════╡
│ Population, Female (Number) ┆ Female       │
│ null                        ┆ null         │
│ Population, Total (Number)  ┆ Total        │
│ Population, Male (Number)   ┆ Male         │
└─────────────────────────────┴──────────────┘
NoteUnderstanding the Regex Pattern

r" - (\w+) \(" breaks down as: - r = raw string (treats backslashes literally) - - = matches literal ” - ” - (\w+) = captures one or more word characters (letters/digits) - \( = matches literal ” (” (backslash escapes the parenthesis)

The parentheses () create a “capture group” that .extract() returns.

Method 4: Using .str.replace() for Pattern Removal

Remove unwanted parts and extract what remains.

# Remove the prefix and suffix to isolate the gender
method_4_df = df_cleaned.with_columns([
    pl.col("variable_name")
    .str.replace("Population, ", "")  # Remove prefix
    .str.replace(r" \(Number\)", "")   # Remove suffix
    .alias("gender_replace")
])

print("✅ Method 4: String replacement")
print(method_4_df.select(["variable_name", "gender_replace"]).unique())
✅ Method 4: String replacement
shape: (4, 2)
┌─────────────────────────────┬────────────────┐
│ variable_name               ┆ gender_replace │
│ ---                         ┆ ---            │
│ str                         ┆ str            │
╞═════════════════════════════╪════════════════╡
│ Population, Total (Number)  ┆ Total          │
│ Population, Male (Number)   ┆ Male           │
│ null                        ┆ null           │
│ Population, Female (Number) ┆ Female         │
└─────────────────────────────┴────────────────┘

Method 5: One-liner with Polars Expressions

Combine everything into a compact, readable expression. This creates our final cleaned dataframe: census_df.

# The most elegant solution: all-in-one chained extraction
# This is our final cleaned dataset with extracted gender information
census_df = (
    df_cleaned.with_columns([
    pl.col("variable_name")
    .str.split(" ")
    .list.get(1)       # Extract middle piece
    .str.split(" (")
    .list.get(0)        # Remove parenthetical
    .str.strip_chars()  # Remove any extra whitespace
    .alias("gender")
    ])
    .rename({"age_name": "age_category"})
)

print("✅ Method 5: Complete chained solution")
print(census_df.select(["variable_name", "gender"]).unique())

# Verify the extraction worked
print("\n🎯 Unique gender values extracted:")
print(census_df["gender"].unique())
✅ Method 5: Complete chained solution
shape: (4, 2)
┌─────────────────────────────┬────────┐
│ variable_name               ┆ gender │
│ ---                         ┆ ---    │
│ str                         ┆ str    │
╞═════════════════════════════╪════════╡
│ Population, Male (Number)   ┆ Male   │
│ Population, Female (Number) ┆ Female │
│ Population, Total (Number)  ┆ Total  │
│ null                        ┆ null   │
└─────────────────────────────┴────────┘

🎯 Unique gender values extracted:
shape: (4,)
Series: 'gender' [str]
[
    "Male"
    "Female"
    "Total"
    null
]
TipWhich Method Should You Use?

Choose based on your needs:

Method Best For Pros Cons
Method 1 Simple, consistent patterns Fast, readable Requires fixed structure
Method 2 Multi-step cleaning Very clear logic More verbose
Method 3 Complex patterns Most flexible Requires regex knowledge
Method 4 Known prefix/suffix Intuitive Less robust to variations
Method 5 Production code Compact, professional Harder for beginners

For this tutorial, Method 5 is ideal: it’s clean, readable, and leverages Polars’ optimization!

Bonus: Cleaning Age Categories

Now let’s also clean the age_category column to standardize the age ranges.

# First, let's see what age categories exist
print("🔍 Current age categories:")
print(census_df["age_category"].unique().sort())
🔍 Current age categories:
shape: (16,)
Series: 'age_category' [str]
[
    null
    "0 To 4"
    "10 To 14"
    "15 To 19"
    "20 To 24"
    …
    "50 To 54"
    "55 To 59"
    "60 To 64"
    "65+"
    "Total"
]

We can use Polars’ powerful expression system to standardize these age ranges. Let’s create standardized age brackets: 0-14, 15-24, 25-34, 35-44, 45-54, 55-64, 65+

Method 1: Using pl.when().then().otherwise() (Polars’ case_when)

# Method 1: Using Polars' when-then-otherwise (similar to SQL CASE WHEN)
census_clean = census_df.with_columns(
    age_category=pl.when(pl.col("age_category").is_in(["0 To 4", "5 To 9", "10 To 14"]))
    .then(pl.lit("0-14"))
    .when(pl.col("age_category").is_in(["15 To 19", "20 To 24"]))
    .then(pl.lit("15-24"))
    .when(pl.col("age_category").is_in(["25 To 29", "30 To 34"]))
    .then(pl.lit("25-34"))
    .when(pl.col("age_category").is_in(["35 To 39", "40 To 44"]))
    .then(pl.lit("35-44"))
    .when(pl.col("age_category").is_in(["45 To 49", "50 To 54"]))
    .then(pl.lit("45-54"))
    .when(pl.col("age_category").is_in(["55 To 59", "60 To 64"]))
    .then(pl.lit("55-64"))
    .when(pl.col("age_category") == "65+")
    .then(pl.lit("65+"))
    .otherwise(pl.col("age_category"))
)

print("✅ Age categories standardized (Method 1: when-then-otherwise)!")
print("\n🎯 New age categories:")
print(census_df["age_category"].unique().sort())
✅ Age categories standardized (Method 1: when-then-otherwise)!

🎯 New age categories:
shape: (16,)
Series: 'age_category' [str]
[
    null
    "0 To 4"
    "10 To 14"
    "15 To 19"
    "20 To 24"
    …
    "50 To 54"
    "55 To 59"
    "60 To 64"
    "65+"
    "Total"
]

Method 2: Using .str.replace_all() with Multiple Patterns

# Method 2: Using replace_all for demonstration
# (We'll keep Method 1's result, but showing this for learning)

# Create a test dataframe to demonstrate Method 2
df_test = census_df.with_columns([
    pl.col("age_category")
    .str.replace_all("0 To 4", "0-14")
    .str.replace_all("5 To 9", "0-14")
    .str.replace_all("10 To 14", "0-14")
    .str.replace_all("15 To 19", "15-24")
    .str.replace_all("20 To 24", "15-24")
    .str.replace_all("25 To 29", "25-34")
    .str.replace_all("30 To 34", "25-34")
    .str.replace_all("35 To 39", "35-44")
    .str.replace_all("40 To 44", "35-44")
    .str.replace_all("45 To 49", "45-54")
    .str.replace_all("50 To 54", "45-54")
    .str.replace_all("55 To 59", "55-64")
    .str.replace_all("60 To 64", "55-64")
    .alias("age_category_alt")
])

print("✅ Method 2 result (for demonstration):")
print(df_test["age_category_alt"].unique().sort())
✅ Method 2 result (for demonstration):
shape: (10,)
Series: 'age_category_alt' [str]
[
    null
    "0-14"
    "15-24"
    "25-34"
    "35-44"
    "40-144"
    "45-54"
    "55-64"
    "65+"
    "Total"
]
NoteComparing Polars Methods for Categorical Mapping

When to use when().then().otherwise(): - Conditional logic with multiple criteria - Most readable for complex conditions - Similar to SQL CASE WHEN - Polars optimizes these expressions automatically

When to use .str.replace_all(): - Simple string find-and-replace - Known patterns that don’t overlap - Quick and straightforward

Performance Note: Both methods are extremely fast in Polars due to: - Vectorized operations in Rust - Query optimization engine - Parallel execution when possible

For our use case, when-then is clearer and more maintainable!

Performance Comparison in Polars

Let’s test which method is more efficient:

import time

# Create a larger test dataset for meaningful timing
test_df = census_clean
# Replicate the data 1000 times
test_df = pl.concat([test_df] * 10_000)

print(f"Test dataset size: {test_df.shape[0]:,} rows\n")

# Method 1: when-then-otherwise
start_time = time.time()
result1 = test_df.with_columns([
    pl.when(pl.col("age_category").is_in(["0 To 4", "5 To 9", "10 To 14"]))
    .then(pl.lit("0-14"))
    .when(pl.col("age_category").is_in(["15 To 19", "20 To 24"]))
    .then(pl.lit("15-24"))
    .when(pl.col("age_category").is_in(["25 To 29", "30 To 34"]))
    .then(pl.lit("25-34"))
    .when(pl.col("age_category").is_in(["35 To 39", "40 To 44"]))
    .then(pl.lit("35-44"))
    .when(pl.col("age_category").is_in(["45 To 49", "50 To 54"]))
    .then(pl.lit("45-54"))
    .when(pl.col("age_category").is_in(["55 To 59", "60 To 64"]))
    .then(pl.lit("55-64"))
    .when(pl.col("age_category") == "65+")
    .then(pl.lit("65+"))
    .otherwise(pl.col("age_category"))
])
time_when = time.time() - start_time

# Method 2: str.replace_all (chained)
start_time = time.time()
result2 = test_df.with_columns([
    pl.col("age_category")
    .str.replace_all("0 To 4", "0-14")
    .str.replace_all("5 To 9", "0-14")
    .str.replace_all("10 To 14", "0-14")
    .str.replace_all("15 To 19", "15-24")
    .str.replace_all("20 To 24", "15-24")
    .str.replace_all("25 To 29", "25-34")
    .str.replace_all("30 To 34", "25-34")
    .str.replace_all("35 To 39", "35-44")
    .str.replace_all("40 To 44", "35-44")
    .str.replace_all("45 To 49", "45-54")
    .str.replace_all("50 To 54", "45-54")
    .str.replace_all("55 To 59", "55-64")
    .str.replace_all("60 To 64", "55-64")
    .alias("age_category_alt")
])
time_replace = time.time() - start_time

# Display results
print("⏱️ Performance Comparison:")
print(f"  when-then:       {time_when*1000:.2f} ms")
print(f"  str.replace:     {time_replace*1000:.2f} ms")

if time_when < time_replace:
    print(f"\n🏆 Winner: when-then is {time_replace/time_when:.1f}x faster")
else:
    print(f"\n🏆 Winner: str.replace is {time_when/time_replace:.1f}x faster")
Test dataset size: 4,530,000 rows

⏱️ Performance Comparison:
  when-then:       1247.89 ms
  str.replace:     1295.13 ms

🏆 Winner: when-then is 1.0x faster
ImportantWhy Polars is So Fast

Polars advantages over pandas: 1. Rust backend: Compiled code is much faster than Python 2. Query optimization: Like a SQL database, Polars optimizes before execution 3. Parallelization: Automatically uses all CPU cores 4. Memory efficiency: Arrow memory format reduces copying 5. Lazy evaluation: Combines operations for maximum efficiency

For our categorical mapping, both methods are optimized by Polars’ query engine!

Step 13: Verify the Extraction

Let’s confirm our extracted values match what we expect.

# Check the distribution of gender values
gender_counts = (
    census_clean
    .group_by("gender")
    .agg([
        pl.count().alias("count"),
        pl.col("population").sum().alias("total_population")
    ])
    .sort("total_population", descending=True)
)

print("📊 Gender distribution in extracted data:")
print(gender_counts)
📊 Gender distribution in extracted data:
shape: (4, 3)
┌────────┬───────┬──────────────────┐
│ gender ┆ count ┆ total_population │
│ ---    ┆ ---   ┆ ---              │
│ str    ┆ u32   ┆ i64              │
╞════════╪═══════╪══════════════════╡
│ Total  ┆ 150   ┆ 16520980         │
│ Male   ┆ 150   ┆ 8574600          │
│ Female ┆ 150   ┆ 7946380          │
│ null   ┆ 3     ┆ 0                │
└────────┴───────┴──────────────────┘

Step 14: Prepare Data for Analysis

Let’s create our final analysis-ready dataframe.

# Select only the columns we need for analysis
census_clean_df = (
    census_clean
    .rename({"region_name": "state"})
    .select(
    "state",
    "gender", 
    "age_category",
    "population"
    )
)

print("✅ Ready for analysis!")
print(f"📋 Final dataset: {census_clean_df.shape[0]:,} rows × {census_clean_df.shape[1]} columns")
print("\nColumn types:")
print(census_clean_df.schema)
✅ Ready for analysis!
📋 Final dataset: 453 rows × 4 columns

Column types:
Schema({'state': String, 'gender': String, 'age_category': String, 'population': Int64})
# Filtering row values
census = (
    census_clean_df
    .filter(pl.col("gender") != "Total",  pl.col("age_category") != "Total")
)

# Inspect output
print(census)
shape: (280, 4)
┌───────────────────┬────────┬──────────────┬────────────┐
│ state             ┆ gender ┆ age_category ┆ population │
│ ---               ┆ ---    ┆ ---          ┆ ---        │
│ str               ┆ str    ┆ str          ┆ i64        │
╞═══════════════════╪════════╪══════════════╪════════════╡
│ Upper Nile        ┆ Male   ┆ 0-14         ┆ 82690      │
│ Upper Nile        ┆ Male   ┆ 0-14         ┆ 83744      │
│ Upper Nile        ┆ Male   ┆ 0-14         ┆ 71027      │
│ Upper Nile        ┆ Male   ┆ 15-24        ┆ 57387      │
│ Upper Nile        ┆ Male   ┆ 15-24        ┆ 42521      │
│ …                 ┆ …      ┆ …            ┆ …          │
│ Eastern Equatoria ┆ Female ┆ 45-54        ┆ 13727      │
│ Eastern Equatoria ┆ Female ┆ 45-54        ┆ 9482       │
│ Eastern Equatoria ┆ Female ┆ 55-64        ┆ 5740       │
│ Eastern Equatoria ┆ Female ┆ 55-64        ┆ 5274       │
│ Eastern Equatoria ┆ Female ┆ 65+          ┆ 8637       │
└───────────────────┴────────┴──────────────┴────────────┘

Step 15: Final Cleaned Dataset Summary

# Summary of our cleaned data
print("=" * 50)
print("🎉 DATA CLEANING COMPLETE!")
print("=" * 50)
print(f"📋 Final shape: {census.shape[0]:,} rows × {census.shape[1]} columns")
print(f"📝 Columns: {census.columns}")
print("\n🔍 Preview:")
census.head()
==================================================
🎉 DATA CLEANING COMPLETE!
==================================================
📋 Final shape: 280 rows × 4 columns
📝 Columns: ['state', 'gender', 'age_category', 'population']

🔍 Preview:
shape: (5, 4)
state gender age_category population
str str str i64
"Upper Nile" "Male" "0-14" 82690
"Upper Nile" "Male" "0-14" 83744
"Upper Nile" "Male" "0-14" 71027
"Upper Nile" "Male" "15-24" 57387
"Upper Nile" "Male" "15-24" 42521

Part 5: Data Analysis

Now that our data is clean, let’s calculate some insights!

Step 16: Calculate Overview Metrics

# Build a summary of key statistics
overview_stats = pl.DataFrame({
    "metric": [
        "Total Population",
        "Number of States",
        "Age Categories",
        "Gender Groups"
    ],
    "value": [
        census["population"].sum(),
        census["state"].n_unique(),
        census["age_category"].n_unique(),
        census["gender"].n_unique()
    ]
})

print("📊 Overview Statistics:")
print(overview_stats)
📊 Overview Statistics:
shape: (4, 2)
┌──────────────────┬─────────┐
│ metric           ┆ value   │
│ ---              ┆ ---     │
│ str              ┆ i64     │
╞══════════════════╪═════════╡
│ Total Population ┆ 8260490 │
│ Number of States ┆ 10      │
│ Age Categories   ┆ 7       │
│ Gender Groups    ┆ 2       │
└──────────────────┴─────────┘

Step 17: National Gender Distribution

# Calculate gender distribution
gender_summary = (
    census
    .group_by("gender")
    .agg([
        pl.col("population").sum().alias("total_population")
    ])
    .with_columns([
        (pl.col("total_population") / pl.col("total_population").sum() * 100)
        .round(2)
        .alias("percentage")
    ])
    .sort("total_population", descending=True)
)

# Display with Great Tables
(
    GT(gender_summary.to_pandas())
    .tab_header(
        title="National Gender Distribution",
        subtitle="South Sudan 2008 Census"
    )
    .fmt_number(
        columns="total_population",
        decimals=0,
        use_seps=True
    )
    .fmt_number(
        columns="percentage",
        decimals=2
    )
    .cols_label(
        gender="Gender",
        total_population="Total Population",
        percentage="Percentage (%)"
    )
)
National Gender Distribution
South Sudan 2008 Census
Gender Total Population Percentage (%)
Male 4,287,300 51.90
Female 3,973,190 48.10

Step 18: Age Distribution Analysis

# Calculate age distribution
age_summary = (
    census
    .group_by("age_category")
    .agg([
        pl.col("population").sum().alias("total_population")
    ])
    .with_columns([
        (pl.col("total_population") / pl.col("total_population").sum() * 100)
        .round(2)
        .alias("percentage")
    ])
    .sort("total_population", descending=True)
)

(
    GT(age_summary.to_pandas())
    .tab_header(
        title="National Age Distribution",
        subtitle="Standardized Age Categories"
    )
    .fmt_number(
        columns="total_population",
        decimals=0,
        use_seps=True
    )
    .fmt_number(
        columns="percentage",
        decimals=2
    )
    .cols_label(
        age_category="Age Category",
        total_population="Total Population",
        percentage="Percentage (%)"
    )
)
National Age Distribution
Standardized Age Categories
Age Category Total Population Percentage (%)
0-14 3,659,337 44.30
15-24 1,628,835 19.72
25-34 1,234,926 14.95
35-44 815,517 9.87
45-54 473,365 5.73
55-64 237,426 2.87
65+ 211,084 2.56

Step 20: Visualize Age Distribution

# Create bar chart
fig_age = (
    ggplot(age_summary, aes("age_category", "total_population", fill="total_population")) +
    geom_col(alpha=0.85) +
    scale_fill_gradient(low="#06b6d4", high="#22d3ee") +
    labs(
        title="National Age Distribution",
        subtitle="Standardized Age Categories",
        x="Age Category",
        y="Population"
    ) +
    geom_text(
        aes(label="percentage"),
        format_string="{:.1f}%",
        va="bottom",
        size=10
    ) +
    scale_y_continuous(labels=lambda l: [f"{int(x/1000)}K" for x in l]) +
    pystatr_theme +
    theme(
        figure_size=(12, 6),
        axis_text_x=element_text(rotation=45, ha="right"),
        legend_position="none"
    )
)

fig_age

Step 21: State-Level Analysis

# Top 10 states by population
state_summary = (
    census
    .group_by("state")
    .agg([
        pl.col("population").sum().alias("total_population")
    ])
    .sort("total_population", descending=True)
    .head(5)
)

(
    GT(state_summary.to_pandas())
    .tab_header(
        title="2008 Census by State",
        subtitle="Ranked by Total Population"
    )
    .fmt_number(
        columns="total_population",
        decimals=0,
        use_seps=True
    )
    .cols_label(
        state="State",
        total_population="Total Population"
    )
)
2008 Census by State
Ranked by Total Population
State Total Population
Jonglei 1,358,602
Central Equatoria 1,103,557
Warrap 972,928
Upper Nile 964,353
Eastern Equatoria 906,161

Step 22: Visualize State Populations

# Create horizontal bar chart
fig_states = (
    ggplot(state_summary , aes("reorder(state, total_population)", "total_population", fill="total_population")) +
    geom_col(alpha=0.85) +
    coord_flip() +
    scale_fill_gradient(low="#1E3A8A", high="#1E3A8A") +
    labs(
        title="Top 10 Most Populous States",
        subtitle="South Sudan 2008 Census",
        x="State",
        y="Population"
    ) +
    scale_y_continuous(labels=lambda l: [f"{int(x/1000)}K" for x in l], limits=(0, 1_500_000)) +
    pystatr_theme +
    theme(
        plot_title=element_text(color="#1E3A8A"),
        figure_size=(12, 8),
        legend_position="none"
    )
)

fig_states


Key Insights

ImportantWhat the Data Tells Us

1. Population Concentration

The top five states account for a significant portion of the national population. This geographic concentration should inform resource allocation decisions.

2. Youth Demographics

The age distribution reveals a young population—typical of developing nations. This “youth bulge” represents both opportunity (workforce potential) and challenge (education, employment needs).

3. Gender Balance

Most states show relatively balanced gender distributions, with some variation that may reflect migration patterns, conflict impacts, or data collection methodology.

4. Regional Disparities

Substantial population differences between states suggest the need for differentiated development strategies rather than uniform national approaches.


Conclusion

Congratulations! You’ve completed a full demographic analysis using Polars and modern Python tools.

What You Learned:

✅ Loading data with pl.read_csv()

✅ Exploring data structure with Polars methods

✅ Cleaning data with Polars expressions

String splitting and extraction with multiple methods

Age category standardization using conditional logic

✅ Aggregating with .group_by() and .agg()

✅ Creating beautiful visualizations with plotnine

✅ Building professional tables with Great Tables

The techniques you’ve learned apply to ANY dataset—from business analytics to scientific research.

TipWhy Choose Polars?

Speed: 5-10x faster than pandas on large datasets
Memory: More efficient Arrow memory format
Syntax: Clean, expressive API
Optimization: Automatic query optimization like SQL databases
Future-proof: Built for modern data science workflows

For large datasets (>1GB), Polars can save hours of processing time!

TipWhy Choose plotnine?

Familiar: Uses ggplot2 grammar (beloved by R users)
Declarative: Build complex plots with readable, layered syntax
Reproducible: Same code produces same results across platforms
Professional: Publication-quality graphics out of the box
Flexible: Easy to customize every aspect of your plots

Perfect for data scientists who want beautiful, maintainable visualization code!

TipNext Steps
  1. Try the lazy API with pl.scan_csv() for even faster processing
  2. Explore Polars’ window functions for advanced analytics
  3. Practice with your own datasets
  4. Experiment with different plotnine themes and geometries
  5. Compare performance with pandas on large files

Technical Reference

Libraries Used

Library Version Purpose
polars 0.20+ Fast data manipulation with Rust backend
plotnine 0.12+ Grammar of graphics visualization (ggplot2 for Python)
Great Tables 0.8+ Professional table formatting

Key Techniques Demonstrated

Technique Description
Polars Expressions Optimized, chainable data transformations
String Splitting Extract structured info from text with .str methods
Conditional Logic when().then().otherwise() for complex mappings
List Operations .list.get() for working with split strings
Aggregations .group_by().agg() for summary statistics
Grammar of Graphics Layered visualization approach with plotnine

Tutorial by Alierwai Reng, Founder of PyStatR+. For data science training or collaboration, reach out at info@pystatrplus.org


TipAbout the Author

Alierwai Reng is the Founder and Lead Educator of PyStatR+, a data science educator, and analytics leader with expertise in statistics and healthcare analytics. His mission is to make technical knowledge accessible through clear, beginner-friendly education. He believes in “Education from the Heart.”

For training, consulting, or collaboration opportunities: 📧 info@pystatrplus.org 🌐 pystatrplus.org

NoteEditor’s Note

This tutorial reflects PyStatR+’s core philosophy: that data science education should be accessible, practical, and empowering. We believe the best learning happens when complexity is distilled into clarity—without sacrificing rigor.

At PyStatR+, we teach from the heart by putting ourselves in your shoes—because learning is a partnership, not a solitary journey.

PyStatR+: Learning Simplified. Communication Amplified.