---
title: "Data Wrangling & Transformation: South Sudan 2008 Census"
subtitle: "A Beginner's Guide to Data Wrangling with Polars"
author: "Alierwai Reng"
date: today
categories: [Python, Polars, Data Wrangling, Demographics, Census]
tags: [python, polars, data-wrangling, data-transformation, census-data, beginners]
image: featured.png
description: "Master essential data wrangling techniques using South Sudan's census data with blazingly fast Polars DataFrames. A beginner-friendly tutorial covering select(), filter(), sort(), group_by(), string methods, and more."
format:
html:
code-fold: false
code-tools: true
toc: true
toc-depth: 3
toc-title: "Tutorial Contents"
execute:
warning: false
message: false
---
# Data Wrangling & Transformation: South Sudan 2008 Census
## A Beginner's Guide to Data Wrangling with Polars
Master essential data wrangling techniques using South Sudan's census data with blazingly fast Polars DataFrames.
## Introduction {#sec-intro}
Welcome to this hands-on data analysis tutorial! This guide showcases **Polars**—the blazingly fast DataFrame library—with essential techniques for real-world data analysis.
By the end of this guide, you'll understand how to:
- **Master essential methods** like `select()`, `filter()`, `with_columns()`, and `group_by()`
- **Load and explore** real-world census data with schema control
- **Transform string data** using Polars' powerful `.str` accessor methods
- **Calculate** summary statistics using `group_by()` and `agg()`
We will analyze South Sudan's 2008 census data as a practical case study; however, the analytical techniques and workflows you will learn are fully transferable to any dataset across domains and contexts.
**Data Source:**
National Bureau of Statistics, South Sudan, via the Open Data for Africa platform:
[Population by Age and Sex (2008)](http://southsudan.opendataforafrica.org/fvjqdpe/population-by-age-and-sex-2008-south-sudan)
::: {.callout-note}
## Tested With
polars 1.37.1, Python 3.14.0
:::
---
## Part 1: Environment Setup {#sec-setup}
### Step 1: Import Required Libraries
Every Python analysis starts with importing the tools we need. Think of libraries as specialized toolboxes—each designed for specific tasks.
```{python}
#| label: load-packages
# System information
import sys
# Data manipulation
import polars as pl # The blazingly fast DataFrame library
import polars.selectors as cs # Column selectors for easy column operations
# Confirmation
print("✅ All libraries loaded successfully!")
print(f"📦 Polars version: {pl.__version__}")
print(f"🐍 Python version: {sys.version.split()[0]}")
```
::: {.callout-note}
## Package Installation
If you don't have this package installed, run this **once** in your terminal:
```bash
uv add "polars>=1.37.1"
pip install polars>=1.37.1
```
**Library Purposes:**
| Library | What It Does |
|---------|--------------|
| **polars** | Fast, memory-efficient data manipulation with Rust backend |
| **polars.selectors** | Easy column selection (similar to tidyselect in R) |
:::
---
## Part 2: Essential Polars Methods {#sec-essential}
Let's master the core Polars methods you'll use in every analysis.
### Step 2: `select()` vs `with_columns()`
`select()` returns only specified columns; `with_columns()` keeps all columns and adds/modifies new ones.
```{python}
#| label: select-vs-with-columns
demo_df = pl.DataFrame({
"name": ["Alek", "Bol", "Chol"],
"age": [25, 30, 35],
"salary": [50000, 60000, 75000]
})
selected = demo_df.select("name", "age")
print(selected)
with_new = demo_df.with_columns(
(pl.col("salary") * 1.1).alias("new_salary")
)
print(with_new)
```
### Step 3: `filter()` — Subsetting Rows
```{python}
#| label: filter-demo
high_earners = demo_df.filter(pl.col("salary") > 55000)
print(high_earners)
filtered = demo_df.filter(
(pl.col("age") >= 30) & (pl.col("salary") > 50000)
)
print(filtered)
```
### Step 4: `sort()` — Ordering Data
```{python}
#| label: sort-demo
sorted_asc = demo_df.sort("salary")
print(sorted_asc)
sorted_desc = demo_df.sort("salary", descending=True)
print(sorted_desc)
```
### Step 5: Column Selectors (`cs`)
Use selectors to pick columns by type: `cs.numeric()`, `cs.string()`, `cs.boolean()`, `cs.temporal()`, or by pattern with `cs.contains()` and `cs.starts_with()`.
```{python}
#| label: selectors-demo
mixed_df = pl.DataFrame({
"name": ["Alek", "Bol"],
"age": [25, 30],
"salary": [50000.0, 60000.0],
"is_manager": [True, False]
})
numeric_cols = mixed_df.select(cs.numeric())
string_cols = mixed_df.select(cs.string())
print(numeric_cols)
print(string_cols)
```
---
## Part 3: Loading the Census Data {#sec-load}
Now let's apply these concepts to real census data!
### Step 6: Load Census Data from URL
```{python}
#| label: load-data
# Define the data source URL
data_url = (
"https://raw.githubusercontent.com/tongakuot/r_tutorials/refs/heads/main/00-input/ss_2008_census_data_raw.csv"
)
# Load the data into a DataFrame
# schema_overrides forces the "2008" column to be read as string (Utf8) instead of integer
# null_values specifies which values should be treated as missing data
df_raw = pl.read_csv(
data_url,
schema_overrides={"2008": pl.Utf8},
null_values=["", "NA", "N/A", "null"]
)
# Confirm successful load
print("✅ Data loaded successfully!")
print(f"📋 Rows: {df_raw.shape[0]:,}")
print(f"📋 Columns: {df_raw.shape[1]}")
print(f"\n📝 Schema: {df_raw.schema}")
```
### Step 7: Explore Data Structure
```{python}
#| label: explore-data
# Preview the data
print("🔍 First 5 rows:")
print(df_raw.head())
# Check for missing values
print("\n❓ Missing values per column:")
print(df_raw.null_count())
# Unique values in key columns
print("\n🗺️ Unique states:")
print(df_raw.get_column("Region Name").unique().sort())
```
---
## Part 4: Data Cleaning with Polars {#sec-clean}
### Step 8: Rename Columns
Use a dictionary for specific columns or a lambda for systematic transformations:
```{python}
#| label: rename-dict
df = df_raw.rename({
"Region Name": "state",
"Variable Name": "gender_raw",
"Age Name": "age_category",
"2008": "population"
})
# Or use lambda for all columns
df = df_raw.rename(
lambda col: col.lower().replace(" ", "_")
)
print(f"Columns: {df.columns}")
```
### Step 9: String Cleaning with `.str` Methods
```{python}
#| label: string-cleaning
string_columns = [col for col, dtype in df.schema.items() if dtype == pl.Utf8]
df_cleaned = df.with_columns([
pl.col(col).str.strip_chars().str.to_titlecase()
for col in string_columns
])
print(df_cleaned.head())
```
### Step 10: Extract Gender Information
The `variable_name` column contains structured text like "Population, Male (Number)". Let's extract the gender:
```{python}
#| label: extract-gender
# First, see what's in the column
print("🔍 Unique values in variable_name:")
print(df_cleaned["variable_name"].unique())
# Extract gender using string methods
census_df = (
df_cleaned
.with_columns([
pl.col("variable_name")
.str.split(" ")
.list.get(1) # Get second word
.str.strip_chars()
.alias("gender")
])
.rename({"age_name": "age_category", "region_name": "state"})
)
print("\n✅ Gender extracted!")
print(census_df.select(["variable_name", "gender"]).unique())
```
### Step 11: Clean Age Categories with `when().then().otherwise()`
```{python}
#| label: clean-age-categories
# First, convert population to numeric
census_df = census_df.with_columns([
pl.col("2008").cast(pl.Int64, strict=False).alias("population")
])
# Standardize age categories
census_clean = census_df.with_columns(
age_category=pl.when(pl.col("age_category").is_in(["0 To 4", "5 To 9", "10 To 14"]))
.then(pl.lit("0-14 (Children)"))
.when(pl.col("age_category").is_in(["15 To 19", "20 To 24"]))
.then(pl.lit("15-24 (Youth)"))
.when(pl.col("age_category").is_in(["25 To 29", "30 To 34", "35 To 39", "40 To 44"]))
.then(pl.lit("25-44 (Adults)"))
.when(pl.col("age_category").is_in(["45 To 49", "50 To 54", "55 To 59", "60 To 64"]))
.then(pl.lit("45-64 (Middle Age)"))
.when(pl.col("age_category") == "65+")
.then(pl.lit("65+ (Seniors)"))
.otherwise(pl.col("age_category"))
)
print("✅ Age categories standardized!")
print(census_clean["age_category"].unique().sort())
```
### Step 12: Prepare Final Dataset
```{python}
#| label: prepare-final
# Select and filter final columns
census = (
census_clean
.select("state", "gender", "age_category", "population")
.filter(
(pl.col("gender") != "Total") &
(pl.col("age_category") != "Total")
)
)
print("✅ Final dataset ready!")
print(f"📋 Shape: {census.shape[0]:,} rows × {census.shape[1]} columns")
print(census.head())
```
---
## Part 5: Analysis & Summary Statistics {#sec-analysis}
Now let's apply our cleaned data to extract meaningful insights!
### Step 13: Gender Distribution Analysis
```{python}
#| label: gender-analysis
# Calculate gender distribution
gender_summary = (
census
.group_by("gender")
.agg(pl.col("population").sum().alias("total_population"))
.with_columns(
(pl.col("total_population") / pl.col("total_population").sum() * 100)
.round(2)
.alias("percentage")
)
.sort("total_population", descending=True)
)
print("👥 Gender Distribution:")
print(gender_summary)
```
### Step 14: Age Distribution Analysis
```{python}
#| label: age-analysis
# Calculate age distribution
age_summary = (
census
.group_by("age_category")
.agg(pl.col("population").sum().alias("total_population"))
.with_columns(
(pl.col("total_population") / pl.col("total_population").sum() * 100)
.round(2)
.alias("percentage")
)
.sort("total_population", descending=True)
)
print("📊 Age Distribution:")
print(age_summary)
```
---
## Key Insights {#sec-insights}
::: {.callout-important icon="true"}
## What the Data Tells Us
### 1. **Population Concentration**
The top states account for a significant portion of the national population. This geographic concentration has direct implications for infrastructure investment and resource allocation.
### 2. **Youth Demographics**
The age distribution reveals a predominantly young population—typical of developing nations. The 0-14 and 15-24 age groups represent the largest segments, indicating a "youth bulge" with both opportunities and challenges.
### 3. **Gender Balance**
Most states show relatively balanced gender distributions, with minor variations that may reflect migration patterns or data collection methodology differences.
### 4. **Regional Disparities**
Seniors (65+) consistently represent the smallest age group across states, highlighting the youthful population structure.
:::
---
## Conclusion {#sec-conclusion}
Congratulations! You've mastered essential Polars methods for data wrangling and transformation!
::: {.callout-tip icon="true"}
## What You've Learned
**Essential Polars Methods:**
✅ `select()` vs `with_columns()` — when to use each
✅ `filter()` for row subsetting with conditions
✅ `sort()` for ordering data
✅ Column selectors (`cs`) for type-based selection
✅ `group_by().agg()` for aggregations
**Data Cleaning & Transformation:**
✅ Renaming columns (dict, comprehension, lambda)
✅ String methods (strip, titlecase, split, extract)
✅ Conditional logic with `when().then().otherwise()`
✅ Type casting with `.cast()`
:::
::: {.callout-note}
## Next Steps for Learning
**Beginner:**
1. Practice `group_by()` and `agg()` on your own data
2. Experiment with different column selectors
3. Try more complex filter conditions
**Intermediate:**
4. Learn the lazy API with `scan_csv()` for larger datasets
5. Explore window functions for advanced analytics
6. Study the streaming API for huge files
**Advanced:**
7. Build data pipelines with Polars' lazy evaluation
8. Compare performance benchmarks with pandas
9. Explore Polars' Rust API for maximum performance
**Resources:**
- [Polars User Guide](https://docs.pola.rs/) — Official documentation
- [Polars Changelog](https://docs.pola.rs/releases/changelog/) — Latest features
:::
---
```{=html}
<!-- Author Card: Alier Reng -->
<hr class="author-section-divider">
<div class="author-card">
<img src="/images/blog/alier-reng-founder.png"
alt="Alier Reng"
class="author-card-photo">
<div class="author-card-info">
<h3>Alier Reng</h3>
<div class="author-card-role">Founder, Lead Educator & Creative Director at PyStatR+</div>
<p class="author-card-bio">
Alier Reng is a Data Scientist, Educator, and Founder of PyStatR+, a platform advancing open and practical data science education. His work blends analytics, philosophy, and storytelling to make complex ideas human and empowering. Knowledge is freedom. Data is truth's language — ethics and transparency, its grammar.
</p>
<div class="author-card-social">
<a href="https://www.pystatrplus.org" title="PyStatR+" aria-label="PyStatR+ Website" class="social-website">
<svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-1 17.93c-3.95-.49-7-3.85-7-7.93 0-.62.08-1.21.21-1.79L9 15v1c0 1.1.9 2 2 2v1.93zm6.9-2.54c-.26-.81-1-1.39-1.9-1.39h-1v-3c0-.55-.45-1-1-1H8v-2h2c.55 0 1-.45 1-1V7h2c1.1 0 2-.9 2-2v-.41c2.93 1.19 5 4.06 5 7.41 0 2.08-.8 3.97-2.1 5.39z"/></svg>
<span>Website</span>
</a>
<a href="https://github.com/Alierwai" title="GitHub" aria-label="GitHub" class="social-github">
<svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 2A10 10 0 0 0 2 12c0 4.42 2.87 8.17 6.84 9.5.5.08.66-.23.66-.5v-1.69c-2.77.6-3.36-1.34-3.36-1.34-.46-1.16-1.11-1.47-1.11-1.47-.91-.62.07-.6.07-.6 1 .07 1.53 1.03 1.53 1.03.87 1.52 2.34 1.07 2.91.83.09-.65.35-1.09.63-1.34-2.22-.25-4.55-1.11-4.55-4.92 0-1.11.38-2 1.03-2.71-.1-.25-.45-1.29.1-2.64 0 0 .84-.27 2.75 1.02.79-.22 1.65-.33 2.5-.33.85 0 1.71.11 2.5.33 1.91-1.29 2.75-1.02 2.75-1.02.55 1.35.2 2.39.1 2.64.65.71 1.03 1.6 1.03 2.71 0 3.82-2.34 4.66-4.57 4.91.36.31.69.92.69 1.85V21c0 .27.16.59.67.5C19.14 20.16 22 16.42 22 12A10 10 0 0 0 12 2z"/></svg>
<span>GitHub</span>
</a>
<a href="https://www.linkedin.com/in/alierreng" title="LinkedIn" aria-label="LinkedIn" class="social-linkedin">
<svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M20.45 20.45h-3.56v-5.57c0-1.33-.02-3.04-1.85-3.04-1.85 0-2.14 1.45-2.14 2.94v5.67H9.34V9h3.41v1.56h.05c.48-.9 1.64-1.85 3.37-1.85 3.6 0 4.27 2.37 4.27 5.46v6.28zM5.34 7.43a2.06 2.06 0 1 1 0-4.12 2.06 2.06 0 0 1 0 4.12zM7.12 20.45H3.56V9h3.56v11.45z"/></svg>
<span>LinkedIn</span>
</a>
<a href="https://youtube.com/@PyStatRPlus" title="YouTube" aria-label="YouTube" class="social-youtube">
<svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M23.498 6.186a3.016 3.016 0 0 0-2.122-2.136C19.505 3.545 12 3.545 12 3.545s-7.505 0-9.377.505A3.017 3.017 0 0 0 .502 6.186C0 8.07 0 12 0 12s0 3.93.502 5.814a3.016 3.016 0 0 0 2.122 2.136c1.871.505 9.376.505 9.376.505s7.505 0 9.377-.505a3.015 3.015 0 0 0 2.122-2.136C24 15.93 24 12 24 12s0-3.93-.502-5.814zM9.545 15.568V8.432L15.818 12l-6.273 3.568z"/></svg>
<span>YouTube</span>
</a>
</div>
</div>
</div>
```
---
## Editor's Note
This tutorial focuses on data wrangling and transformation with **Polars**, covering essential methods for real-world data analysis. The emphasis on comparing methods (`select()` vs `with_columns()`, renaming approaches) reflects our belief that understanding *when* to use each tool is as important as knowing *how* to use it. Visualization and tabulation are covered in dedicated companion tutorials.
---
## Acknowledgements
This lesson is part of the broader **PyStatR+ Learning Platform**, developed with gratitude to mentors, learners, and the open-source community. Special thanks to Ritchie Vink and the Polars contributors for creating such an exceptional tool.
---
## References
- [Polars User Guide](https://docs.pola.rs/) — Official documentation
- [Polars Releases](https://github.com/pola-rs/polars/releases) — GitHub releases
- [Polars Changelog](https://docs.pola.rs/releases/changelog/) — Detailed changelog
- [South Sudan National Bureau of Statistics](http://southsudan.opendataforafrica.org/) — Data source
---
**PyStatR+** — *Learning Simplified. Communication Amplified.* 🚀
Join the Conversation
Share your thoughts, ask questions, or contribute insights