Data Wrangling & Transformation: South Sudan 2008 Census

A Beginner’s Guide to Data Wrangling with Polars

Python
Polars
Data Wrangling
Demographics
Census
Master essential data wrangling techniques using South Sudan’s census data with blazingly fast Polars DataFrames. A beginner-friendly tutorial covering select(), filter(), sort(), group_by(), string methods, and more.
Author

Alierwai Reng

Published

February 8, 2026

Data Wrangling & Transformation: South Sudan 2008 Census

A Beginner’s Guide to Data Wrangling with Polars

Master essential data wrangling techniques using South Sudan’s census data with blazingly fast Polars DataFrames.

Introduction

Welcome to this hands-on data analysis tutorial! This guide showcases Polars—the blazingly fast DataFrame library—with essential techniques for real-world data analysis.

By the end of this guide, you’ll understand how to:

  • Master essential methods like select(), filter(), with_columns(), and group_by()
  • Load and explore real-world census data with schema control
  • Transform string data using Polars’ powerful .str accessor methods
  • Calculate summary statistics using group_by() and agg()

We will analyze South Sudan’s 2008 census data as a practical case study; however, the analytical techniques and workflows you will learn are fully transferable to any dataset across domains and contexts.

Data Source: National Bureau of Statistics, South Sudan, via the Open Data for Africa platform: Population by Age and Sex (2008)

NoteTested With

polars 1.37.1, Python 3.14.0


Part 1: Environment Setup

Step 1: Import Required Libraries

Every Python analysis starts with importing the tools we need. Think of libraries as specialized toolboxes—each designed for specific tasks.

# System information
import sys

# Data manipulation
import polars as pl           # The blazingly fast DataFrame library
import polars.selectors as cs  # Column selectors for easy column operations

# Confirmation
print("✅ All libraries loaded successfully!")
print(f"📦 Polars version: {pl.__version__}")
print(f"🐍 Python version: {sys.version.split()[0]}")
✅ All libraries loaded successfully!
📦 Polars version: 1.25.2
🐍 Python version: 3.13.2
NotePackage Installation

If you don’t have this package installed, run this once in your terminal:

uv add "polars>=1.37.1"
pip install polars>=1.37.1

Library Purposes:

Library What It Does
polars Fast, memory-efficient data manipulation with Rust backend
polars.selectors Easy column selection (similar to tidyselect in R)

Part 2: Essential Polars Methods

Let’s master the core Polars methods you’ll use in every analysis.

Step 2: select() vs with_columns()

select() returns only specified columns; with_columns() keeps all columns and adds/modifies new ones.

demo_df = pl.DataFrame({
    "name": ["Alek", "Bol", "Chol"],
    "age": [25, 30, 35],
    "salary": [50000, 60000, 75000]
})

selected = demo_df.select("name", "age")
print(selected)

with_new = demo_df.with_columns(
    (pl.col("salary") * 1.1).alias("new_salary")
)
print(with_new)
shape: (3, 2)
┌──────┬─────┐
│ name ┆ age │
│ ---  ┆ --- │
│ str  ┆ i64 │
╞══════╪═════╡
│ Alek ┆ 25  │
│ Bol  ┆ 30  │
│ Chol ┆ 35  │
└──────┴─────┘
shape: (3, 4)
┌──────┬─────┬────────┬────────────┐
│ name ┆ age ┆ salary ┆ new_salary │
│ ---  ┆ --- ┆ ---    ┆ ---        │
│ str  ┆ i64 ┆ i64    ┆ f64        │
╞══════╪═════╪════════╪════════════╡
│ Alek ┆ 25  ┆ 50000  ┆ 55000.0    │
│ Bol  ┆ 30  ┆ 60000  ┆ 66000.0    │
│ Chol ┆ 35  ┆ 75000  ┆ 82500.0    │
└──────┴─────┴────────┴────────────┘

Step 3: filter() — Subsetting Rows

high_earners = demo_df.filter(pl.col("salary") > 55000)
print(high_earners)

filtered = demo_df.filter(
    (pl.col("age") >= 30) & (pl.col("salary") > 50000)
)
print(filtered)
shape: (2, 3)
┌──────┬─────┬────────┐
│ name ┆ age ┆ salary │
│ ---  ┆ --- ┆ ---    │
│ str  ┆ i64 ┆ i64    │
╞══════╪═════╪════════╡
│ Bol  ┆ 30  ┆ 60000  │
│ Chol ┆ 35  ┆ 75000  │
└──────┴─────┴────────┘
shape: (2, 3)
┌──────┬─────┬────────┐
│ name ┆ age ┆ salary │
│ ---  ┆ --- ┆ ---    │
│ str  ┆ i64 ┆ i64    │
╞══════╪═════╪════════╡
│ Bol  ┆ 30  ┆ 60000  │
│ Chol ┆ 35  ┆ 75000  │
└──────┴─────┴────────┘

Step 4: sort() — Ordering Data

sorted_asc = demo_df.sort("salary")
print(sorted_asc)

sorted_desc = demo_df.sort("salary", descending=True)
print(sorted_desc)
shape: (3, 3)
┌──────┬─────┬────────┐
│ name ┆ age ┆ salary │
│ ---  ┆ --- ┆ ---    │
│ str  ┆ i64 ┆ i64    │
╞══════╪═════╪════════╡
│ Alek ┆ 25  ┆ 50000  │
│ Bol  ┆ 30  ┆ 60000  │
│ Chol ┆ 35  ┆ 75000  │
└──────┴─────┴────────┘
shape: (3, 3)
┌──────┬─────┬────────┐
│ name ┆ age ┆ salary │
│ ---  ┆ --- ┆ ---    │
│ str  ┆ i64 ┆ i64    │
╞══════╪═════╪════════╡
│ Chol ┆ 35  ┆ 75000  │
│ Bol  ┆ 30  ┆ 60000  │
│ Alek ┆ 25  ┆ 50000  │
└──────┴─────┴────────┘

Step 5: Column Selectors (cs)

Use selectors to pick columns by type: cs.numeric(), cs.string(), cs.boolean(), cs.temporal(), or by pattern with cs.contains() and cs.starts_with().

mixed_df = pl.DataFrame({
    "name": ["Alek", "Bol"],
    "age": [25, 30],
    "salary": [50000.0, 60000.0],
    "is_manager": [True, False]
})

numeric_cols = mixed_df.select(cs.numeric())
string_cols = mixed_df.select(cs.string())
print(numeric_cols)
print(string_cols)
shape: (2, 2)
┌─────┬─────────┐
│ age ┆ salary  │
│ --- ┆ ---     │
│ i64 ┆ f64     │
╞═════╪═════════╡
│ 25  ┆ 50000.0 │
│ 30  ┆ 60000.0 │
└─────┴─────────┘
shape: (2, 1)
┌──────┐
│ name │
│ ---  │
│ str  │
╞══════╡
│ Alek │
│ Bol  │
└──────┘

Part 3: Loading the Census Data

Now let’s apply these concepts to real census data!

Step 6: Load Census Data from URL

# Define the data source URL
data_url = (
    "https://raw.githubusercontent.com/tongakuot/r_tutorials/refs/heads/main/00-input/ss_2008_census_data_raw.csv"
)

# Load the data into a DataFrame
# schema_overrides forces the "2008" column to be read as string (Utf8) instead of integer
# null_values specifies which values should be treated as missing data
df_raw = pl.read_csv(
    data_url,
    schema_overrides={"2008": pl.Utf8},
    null_values=["", "NA", "N/A", "null"]
)

# Confirm successful load
print("✅ Data loaded successfully!")
print(f"📋 Rows: {df_raw.shape[0]:,}")
print(f"📋 Columns: {df_raw.shape[1]}")
print(f"\n📝 Schema: {df_raw.schema}")
✅ Data loaded successfully!
📋 Rows: 453
📋 Columns: 10

📝 Schema: Schema({'Region': String, 'Region Name': String, 'Region - RegionId': String, 'Variable': String, 'Variable Name': String, 'Age': String, 'Age Name': String, 'Scale': String, 'Units': String, '2008': String})

Step 7: Explore Data Structure

# Preview the data
print("🔍 First 5 rows:")
print(df_raw.head())

# Check for missing values
print("\n❓ Missing values per column:")
print(df_raw.null_count())

# Unique values in key columns
print("\n🗺️ Unique states:")
print(df_raw.get_column("Region Name").unique().sort())
🔍 First 5 rows:
shape: (5, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬─────────┬────────┐
│ Region ┆ Region Name ┆ Region - RegionId ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units   ┆ 2008   │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---     ┆ ---    │
│ str    ┆ str         ┆ str               ┆ str      ┆   ┆ str      ┆ str   ┆ str     ┆ str    │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═════════╪════════╡
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ Total    ┆ units ┆ Persons ┆ 964353 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 0 to 4   ┆ units ┆ Persons ┆ 150872 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 5 to 9   ┆ units ┆ Persons ┆ 151467 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 10 to 14 ┆ units ┆ Persons ┆ 126140 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 15 to 19 ┆ units ┆ Persons ┆ 103804 │
└────────┴─────────────┴───────────────────┴──────────┴───┴──────────┴───────┴─────────┴────────┘

❓ Missing values per column:
shape: (1, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬───────┬──────┐
│ Region ┆ Region Name ┆ Region - RegionId ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units ┆ 2008 │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---   ┆ ---  │
│ u32    ┆ u32         ┆ u32               ┆ u32      ┆   ┆ u32      ┆ u32   ┆ u32   ┆ u32  │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═══════╪══════╡
│ 1      ┆ 1           ┆ 3                 ┆ 3        ┆ … ┆ 3        ┆ 3     ┆ 3     ┆ 3    │
└────────┴─────────────┴───────────────────┴──────────┴───┴──────────┴───────┴───────┴──────┘

🗺️ Unique states:
shape: (13,)
Series: 'Region Name' [str]
[
    null
    "Central Equatoria"
    "Eastern Equatoria"
    "Jonglei"
    "Lakes"
    …
    "Upper Nile"
    "Warrap"
    "Western Bahr el Ghazal"
    "Western Equatoria"
    "http://southsudan.opendatafora…
]

Part 4: Data Cleaning with Polars

Step 8: Rename Columns

Use a dictionary for specific columns or a lambda for systematic transformations:

df = df_raw.rename({
    "Region Name": "state",
    "Variable Name": "gender_raw",
    "Age Name": "age_category",
    "2008": "population"
})

# Or use lambda for all columns
df = df_raw.rename(
    lambda col: col.lower().replace(" ", "_")
)

print(f"Columns: {df.columns}")
Columns: ['region', 'region_name', 'region_-_regionid', 'variable', 'variable_name', 'age', 'age_name', 'scale', 'units', '2008']

Step 9: String Cleaning with .str Methods

string_columns = [col for col, dtype in df.schema.items() if dtype == pl.Utf8]

df_cleaned = df.with_columns([
    pl.col(col).str.strip_chars().str.to_titlecase()
    for col in string_columns
])

print(df_cleaned.head())
shape: (5, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬─────────┬────────┐
│ region ┆ region_name ┆ region_-_regionid ┆ variable ┆ … ┆ age_name ┆ scale ┆ units   ┆ 2008   │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---     ┆ ---    │
│ str    ┆ str         ┆ str               ┆ str      ┆   ┆ str      ┆ str   ┆ str     ┆ str    │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═════════╪════════╡
│ Kn.A2  ┆ Upper Nile  ┆ Ss-Nu             ┆ Kn.B2    ┆ … ┆ Total    ┆ Units ┆ Persons ┆ 964353 │
│ Kn.A2  ┆ Upper Nile  ┆ Ss-Nu             ┆ Kn.B2    ┆ … ┆ 0 To 4   ┆ Units ┆ Persons ┆ 150872 │
│ Kn.A2  ┆ Upper Nile  ┆ Ss-Nu             ┆ Kn.B2    ┆ … ┆ 5 To 9   ┆ Units ┆ Persons ┆ 151467 │
│ Kn.A2  ┆ Upper Nile  ┆ Ss-Nu             ┆ Kn.B2    ┆ … ┆ 10 To 14 ┆ Units ┆ Persons ┆ 126140 │
│ Kn.A2  ┆ Upper Nile  ┆ Ss-Nu             ┆ Kn.B2    ┆ … ┆ 15 To 19 ┆ Units ┆ Persons ┆ 103804 │
└────────┴─────────────┴───────────────────┴──────────┴───┴──────────┴───────┴─────────┴────────┘

Step 10: Extract Gender Information

The variable_name column contains structured text like “Population, Male (Number)”. Let’s extract the gender:

# First, see what's in the column
print("🔍 Unique values in variable_name:")
print(df_cleaned["variable_name"].unique())

# Extract gender using string methods
census_df = (
    df_cleaned
    .with_columns([
        pl.col("variable_name")
        .str.split(" ")
        .list.get(1)       # Get second word
        .str.strip_chars()
        .alias("gender")
    ])
    .rename({"age_name": "age_category", "region_name": "state"})
)

print("\n✅ Gender extracted!")
print(census_df.select(["variable_name", "gender"]).unique())
🔍 Unique values in variable_name:
shape: (4,)
Series: 'variable_name' [str]
[
    "Population, Total (Number)"
    "Population, Female (Number)"
    "Population, Male (Number)"
    null
]

✅ Gender extracted!
shape: (4, 2)
┌─────────────────────────────┬────────┐
│ variable_name               ┆ gender │
│ ---                         ┆ ---    │
│ str                         ┆ str    │
╞═════════════════════════════╪════════╡
│ null                        ┆ null   │
│ Population, Total (Number)  ┆ Total  │
│ Population, Male (Number)   ┆ Male   │
│ Population, Female (Number) ┆ Female │
└─────────────────────────────┴────────┘

Step 11: Clean Age Categories with when().then().otherwise()

# First, convert population to numeric
census_df = census_df.with_columns([
    pl.col("2008").cast(pl.Int64, strict=False).alias("population")
])

# Standardize age categories
census_clean = census_df.with_columns(
    age_category=pl.when(pl.col("age_category").is_in(["0 To 4", "5 To 9", "10 To 14"]))
    .then(pl.lit("0-14 (Children)"))
    .when(pl.col("age_category").is_in(["15 To 19", "20 To 24"]))
    .then(pl.lit("15-24 (Youth)"))
    .when(pl.col("age_category").is_in(["25 To 29", "30 To 34", "35 To 39", "40 To 44"]))
    .then(pl.lit("25-44 (Adults)"))
    .when(pl.col("age_category").is_in(["45 To 49", "50 To 54", "55 To 59", "60 To 64"]))
    .then(pl.lit("45-64 (Middle Age)"))
    .when(pl.col("age_category") == "65+")
    .then(pl.lit("65+ (Seniors)"))
    .otherwise(pl.col("age_category"))
)

print("✅ Age categories standardized!")
print(census_clean["age_category"].unique().sort())
✅ Age categories standardized!
shape: (7,)
Series: 'age_category' [str]
[
    null
    "0-14 (Children)"
    "15-24 (Youth)"
    "25-44 (Adults)"
    "45-64 (Middle Age)"
    "65+ (Seniors)"
    "Total"
]

Step 12: Prepare Final Dataset

# Select and filter final columns
census = (
    census_clean
    .select("state", "gender", "age_category", "population")
    .filter(
        (pl.col("gender") != "Total") &
        (pl.col("age_category") != "Total")
    )
)

print("✅ Final dataset ready!")
print(f"📋 Shape: {census.shape[0]:,} rows × {census.shape[1]} columns")
print(census.head())
✅ Final dataset ready!
📋 Shape: 280 rows × 4 columns
shape: (5, 4)
┌────────────┬────────┬─────────────────┬────────────┐
│ state      ┆ gender ┆ age_category    ┆ population │
│ ---        ┆ ---    ┆ ---             ┆ ---        │
│ str        ┆ str    ┆ str             ┆ i64        │
╞════════════╪════════╪═════════════════╪════════════╡
│ Upper Nile ┆ Male   ┆ 0-14 (Children) ┆ 82690      │
│ Upper Nile ┆ Male   ┆ 0-14 (Children) ┆ 83744      │
│ Upper Nile ┆ Male   ┆ 0-14 (Children) ┆ 71027      │
│ Upper Nile ┆ Male   ┆ 15-24 (Youth)   ┆ 57387      │
│ Upper Nile ┆ Male   ┆ 15-24 (Youth)   ┆ 42521      │
└────────────┴────────┴─────────────────┴────────────┘

Part 5: Analysis & Summary Statistics

Now let’s apply our cleaned data to extract meaningful insights!

Step 13: Gender Distribution Analysis

# Calculate gender distribution
gender_summary = (
    census
    .group_by("gender")
    .agg(pl.col("population").sum().alias("total_population"))
    .with_columns(
        (pl.col("total_population") / pl.col("total_population").sum() * 100)
        .round(2)
        .alias("percentage")
    )
    .sort("total_population", descending=True)
)

print("👥 Gender Distribution:")
print(gender_summary)
👥 Gender Distribution:
shape: (2, 3)
┌────────┬──────────────────┬────────────┐
│ gender ┆ total_population ┆ percentage │
│ ---    ┆ ---              ┆ ---        │
│ str    ┆ i64              ┆ f64        │
╞════════╪══════════════════╪════════════╡
│ Male   ┆ 4287300          ┆ 51.9       │
│ Female ┆ 3973190          ┆ 48.1       │
└────────┴──────────────────┴────────────┘

Step 14: Age Distribution Analysis

# Calculate age distribution
age_summary = (
    census
    .group_by("age_category")
    .agg(pl.col("population").sum().alias("total_population"))
    .with_columns(
        (pl.col("total_population") / pl.col("total_population").sum() * 100)
        .round(2)
        .alias("percentage")
    )
    .sort("total_population", descending=True)
)

print("📊 Age Distribution:")
print(age_summary)
📊 Age Distribution:
shape: (5, 3)
┌────────────────────┬──────────────────┬────────────┐
│ age_category       ┆ total_population ┆ percentage │
│ ---                ┆ ---              ┆ ---        │
│ str                ┆ i64              ┆ f64        │
╞════════════════════╪══════════════════╪════════════╡
│ 0-14 (Children)    ┆ 3659337          ┆ 44.3       │
│ 25-44 (Adults)     ┆ 2050443          ┆ 24.82      │
│ 15-24 (Youth)      ┆ 1628835          ┆ 19.72      │
│ 45-64 (Middle Age) ┆ 710791           ┆ 8.6        │
│ 65+ (Seniors)      ┆ 211084           ┆ 2.56       │
└────────────────────┴──────────────────┴────────────┘

Key Insights

ImportantWhat the Data Tells Us

1. Population Concentration

The top states account for a significant portion of the national population. This geographic concentration has direct implications for infrastructure investment and resource allocation.

2. Youth Demographics

The age distribution reveals a predominantly young population—typical of developing nations. The 0-14 and 15-24 age groups represent the largest segments, indicating a “youth bulge” with both opportunities and challenges.

3. Gender Balance

Most states show relatively balanced gender distributions, with minor variations that may reflect migration patterns or data collection methodology differences.

4. Regional Disparities

Seniors (65+) consistently represent the smallest age group across states, highlighting the youthful population structure.


Conclusion

Congratulations! You’ve mastered essential Polars methods for data wrangling and transformation!

TipWhat You’ve Learned

Essential Polars Methods:select() vs with_columns() — when to use each ✅ filter() for row subsetting with conditions ✅ sort() for ordering data ✅ Column selectors (cs) for type-based selection ✅ group_by().agg() for aggregations

Data Cleaning & Transformation: ✅ Renaming columns (dict, comprehension, lambda) ✅ String methods (strip, titlecase, split, extract) ✅ Conditional logic with when().then().otherwise() ✅ Type casting with .cast()

NoteNext Steps for Learning

Beginner:

  1. Practice group_by() and agg() on your own data
  2. Experiment with different column selectors
  3. Try more complex filter conditions

Intermediate:

  1. Learn the lazy API with scan_csv() for larger datasets
  2. Explore window functions for advanced analytics
  3. Study the streaming API for huge files

Advanced:

  1. Build data pipelines with Polars’ lazy evaluation
  2. Compare performance benchmarks with pandas
  3. Explore Polars’ Rust API for maximum performance

Resources:



Alier Reng

Alier Reng

Founder, Lead Educator & Creative Director at PyStatR+

Alier Reng is a Data Scientist, Educator, and Founder of PyStatR+, a platform advancing open and practical data science education. His work blends analytics, philosophy, and storytelling to make complex ideas human and empowering. Knowledge is freedom. Data is truth's language — ethics and transparency, its grammar.


Editor’s Note

This tutorial focuses on data wrangling and transformation with Polars, covering essential methods for real-world data analysis. The emphasis on comparing methods (select() vs with_columns(), renaming approaches) reflects our belief that understanding when to use each tool is as important as knowing how to use it. Visualization and tabulation are covered in dedicated companion tutorials.


Acknowledgements

This lesson is part of the broader PyStatR+ Learning Platform, developed with gratitude to mentors, learners, and the open-source community. Special thanks to Ritchie Vink and the Polars contributors for creating such an exceptional tool.


References


PyStatR+Learning Simplified. Communication Amplified. 🚀

Atoch — PyStatR+ Executive Assistant (FAQ) Online
Provides informational guidance based on publicly available PyStatR+ resources.
I am Atoch, the PyStatR+ Executive Assistant. I help visitors understand PyStatR+, answer frequently asked questions, and guide users to official resources.

How may I assist you today?
💬 Discussion

Join the Conversation

Share your thoughts, ask questions, or contribute insights