SQL Data Manipulation Guide

Comprehensive guide to manipulating and managing data using SQL, including joins, aggregations, date functions, and data cleaning techniques.

SQL
Database Management
Data Analysis
Author

Habtamu Bizuayehu

Published

October 1, 2025

🧰 Objective and Contents of the Presentation

This presentation provides a focused, end-to-end path for working with PostgreSQL on Windows—from installation to analysis and reporting. It begins by installing PostgreSQL with the EnterpriseDB bundle, registering the server in pgAdmin, and using StackBuilder for optional tools. It then shows how to create a new database, import data (e.g., .csv), and perform core SQL data manipulation (filtering, grouping, joins, window functions). Finally, it demonstrates how to connect SQL with R via Quarto/R Markdown to generate reproducible reports in HTML/PDF that combine narrative, code, and results.


🧭 Presentation Roadmap

The flow is intentional: begin with foundations (Part I), load and explore data (Part II), then connect and publish reproducible outputs (Part III). After that, move through core and advanced wrangling (Parts IV–V), descriptive statistics (Part VI), data quality and types (Part VII), performance tuning (Part VIII), and finally security, backup, and sharing (Part IX).

Part I — Introduction and Installing SQL

  • What is SQL & why it matters (ACID, integrity, performance)
  • PostgreSQL overview & tooling (pgAdmin, psql)

Part II — Connecting with R, Rmarkdown, Quarto - Connecting from R/Quarto; running and saving scripts; comments; shortcuts

Part III — Data Loading & Exploration

  • Table creation (DDL) & data types
  • CSV import: COPY / \copy and permissions
  • Basic exploration: previews, counts, distincts, NULL checks

Part IV — Wrangling Core (SELECT / WHERE / CASE / JOIN)

  • Selecting & aliasing; computed columns
  • Filtering, sorting, limiting
  • Conditional logic with CASE, COALESCE, NULLIF
  • Joins (INNER/LEFT/RIGHT/FULL) and practical patterns
  • Appending data

Part V — Advanced Wrangling

  • CTEs (WITH) for readable pipelines
  • Window functions (ranking, partitions, moving calculations)
  • Deduplication with ROW_NUMBER(); upserts overview

Part VI — Data Quality, Types & Dates

  • Missing data strategies (fill, filter, profile)
  • Casting & cleaning (regex, numeric/text conversions)
  • Date/time basics and analysis

Part VII — Descriptive Statistics & Tabulation

  • Aggregations & GROUP BY; HAVING filters
  • Ordered-set aggregates (percentiles)
  • Crosstab/pivot with tablefunc extension

Part VIII — Performance Basics

  • Indexing strategies (B-tree on filters/group keys)
  • Reading query plans with EXPLAIN ANALYZE
  • Simple tuning tips (predicates, projections, ordering)

Part IX — Security, Backup & Sharing

  • Roles & privileges (grant read-only to learners)
  • Backups: pgAdmin / pg_dump & restores
  • Exporting results (CSV/HTML) & reporting with Quarto

Part I — Introduction: Installing, RDBMS, operations and toolkit

This section sets the context: what SQL is, when a relational database is the right tool, and how ACID {Atomicity (all-or-nothing transactions), Consistency (always obeys constraints), Isolation (concurrent work behaves as if run alone), Durability (commits persist even after failures)} properties and well-designed schemas support reliable analytics at scale. It briefly compare mainstream RDBMS options. Then I walk through a clean Windows install using the EnterpriseDB bundle, registering the server in pgAdmin and verifying connectivity before any data work. In short, the hands-on toolkit centers on SELECT … FROM with WHERE filters, ORDER BY/LIMIT shaping, and DISTINCT/aliases; aggregations (COUNT/SUM/AVG) with GROUP BY/HAVING; joins to combine tables; and a few essentials—COALESCE/CASE, type casts (::), and common string/date helpers.

🧭 Relational Database Management Systems (RDBMS) Landscape

Relational Database Management Systems (RDBMS) are the engines that store and query structured, table-shaped data using SQL. They matter because they enforce data integrity (ACID transactions), power reporting and analytics, and keep apps reliable under load. Your choice of RDBMS impacts performance, features you can lean on, skills availability in the team, and ongoing licensing/hosting costs.

  • PostgreSQL — Open-source, strong data types, constraints, JSON, CTEs (Common Table Expression), window functions, and extensions (e.g., PostGIS). Excellent for analytics + app backends.
  • MySQL / MariaDB — Common in web stacks and hosting; straightforward and lightweight. Great for websites and SaaS backends.
  • Microsoft SQL Server — Popular in enterprise settings (Azure/Windows shops). Tight integration with Power BI and the MS ecosystem.
  • Oracle Database — Feature-rich enterprise workhorse used by government/finance; heavy on licensing and administration.
  • Managed Cloud (DBaaS) — AWS RDS / Azure / Google Cloud SQL hosting PostgreSQL/MySQL/SQL Server with patching, backups, and scaling handled for you.

This Project choice — PostgreSQL:
I used PostgreSQL for this project because it offers modern SQL features without licence, runs well on Windows and all major clouds, and scales from local development to production. It has great community support and extensions when you need them.

🔗 Downloading and Installing SQL

Download the PostgreSQL installer from:

👉 https://www.enterprisedb.com/downloads/postgres-postgresql-downloads

Select the Windows version eg x86-64. This installer is bundled with:

  • PostgreSQL Server
  • pgAdmin
  • StackBuilder
  • Command-line tools like psql

Bundling everything in one installer saves setup headaches later on.


📥 Installation Steps (Windows)

  1. Run the Installer
    Double-click the .exe file to launch the setup wizard.

  2. Select Components
    Leave all defaults selected:

    • ✅ PostgreSQL Server
    • ✅ pgAdmin
    • ✅ Command Line Tools
    • 🔄 StackBuilder (highly recommended)
  3. Choose Installation Directory
    Stick with: C:\Program Files\PostgreSQL\<version>, unless you have a good reason to move it.

  4. Set a Password for postgres
    This password secures the default admin account.
    💡 Tip: store it safely — you’ll need it to log into pgAdmin later.

  5. Keep the Default Port
    Port 5432 is standard for PostgreSQL. Change it only if you know you have a conflict.

  6. Accept the Locale Settings
    The default system locale works fine for most projects. You can later create databases with a specific locale if required

  7. Finish the Installation
    After confirming settings, the installer will complete the setup and initialise the database service.


First Launch: Setting Up pgAdmin

Launch pgAdmin from the Start Menu.

  1. Click “Add New Server”

    You’ll be prompted to register your PostgreSQL server so pgAdmin can manage it.

  2. Configure the Connection

    • General tab:
      • Name: Local PostgreSQL
    • Connection tab:
      • Host: localhost
      • Port: 5432
      • Username: postgres
      • Password: (your password from earlier)

    ✅ Check Save Password so you’re not prompted every time.

  3. Click Save

    You’ll now see your server listed in the left pane. Expand it to view databases, schemas, tables, and more.

    • If you see two entries (e.g., Local PostgreSQL and PostgreSQL 17), they’re just pgAdmin shortcuts. Keep the auto-detected one and feel free to delete the duplicate ne eg localhost.
    • You’ll reconnect each time you reopen pgAdmin (right-click server → Connect). If the password isn’t remembered, re-enter it and tick Save Password.

Verifying the Connection

  • Right-click the server → Connect
  • Expand Databases > postgres > Schemas > public
  • This confirms your PostgreSQL server is active and accessible.

You’re now ready to manage your databases through the GUI.


🔌 Using StackBuilder to Install Extensions

  • Select your PostgreSQL instance (e.g., PostgreSQL 15 on port 5432)
  • Click Next
  • You’ll see several categories. Here’s a guide:

What is recommended:

Category Description
Add-ons, tools and utilities Tools like pgAgent, pgBackRest, or adminpack — helpful for backups, scheduling, and server-side utilities.
Database Drivers ODBC, JDBC, and .NET drivers — install if you’re planning to connect PostgreSQL from BI tools or other languages like Java or R.

Optional:

Category Use only if…
Spatial Extensions You need GIS support — this includes PostGIS.
Web Development You’re working on a full PHP/PostgreSQL stack (less common).
Database Server Already installed via initial setup — no need to repeat.
Registration-required Mostly trialware; skip unless testing something specific.

Once selected, follow the prompts to download and install the tools/extensions you need.


- Reconnect behaviour: Each time you reopen pgAdmin, right-click the server → Connect (password won’t be prompted if saved).

If you see fe_sendauth: no password supplied, open the server properties → Connection tab → re-enter password and tick Save Password if required.


SQL Essential Functions and Operations — Overview

This section offers an overview of essential SQL functions and operations, focusing on their use in data manipulation. We’ll explore core concepts that allow you to retrieve, filter, sort, and aggregate data, which are fundamental to working with any dataset. Later, these principles will be demonstrated using the public.global_health_statistics dataset within the SQL_Data_Manipulation database.


  1. Querying Basics Use SELECT ... FROM ... to retrieve columns and rows. Add WHERE to filter, ORDER BY to sort, LIMIT to cap rows, and DISTINCT to remove duplicates. Column and table aliases keep queries readable.

  1. Operators & Predicates Core building blocks for conditions: comparison (=, <>, <, <=, >, >=), logical (AND, OR, NOT), membership (IN), ranges (BETWEEN), pattern match (LIKE, ILIKE), null checks (IS NULL, IS NOT NULL), and regex (~, ~*) in PostgreSQL.

  1. Identifiers, Aliases, Schemas Qualify objects as schema.table (e.g., public.global_health_statistics). Use AS for aliases (e.g., ghs). Strings use single quotes '...'; double quotes preserve case/characters in identifiers; end statements with ;. Comments: -- line and /* block */.

  1. Aggregation & Grouping Aggregate with COUNT, SUM, AVG, MIN, MAX. Use GROUP BY to define groups and HAVING to filter aggregated results.chn

  1. Joining Data Combine tables on keys. Typical joins: INNER (matches), LEFT (preserve left), RIGHT, FULL, and CROSS. Prefer explicit JOIN ... ON ... with clear key conditions.

  1. Subqueries & CTEs Nest queries inside WHERE/FROM or use CTEs with WITH. CTEs improve readability and can be chained; PostgreSQL also supports WITH RECURSIVE for hierarchical problems.

  1. Set Operations Stack compatible result sets vertically: UNION (dedup), UNION ALL (no dedup), INTERSECT (overlap), EXCEPT (left minus right).

  1. Window Functions Compute ranks, running totals, and moving averages without collapsing rows: ... OVER (PARTITION BY ... ORDER BY ...). Common helpers: ROW_NUMBER, RANK, LAG, LEAD, SUM/AVG OVER.

  1. Strings, Numbers, Dates Strings: UPPER, LOWER, TRIM, LENGTH, REPLACE, SUBSTRING, concatenation ||.
    Numbers: ROUND, CEIL, FLOOR, safe division with NULLIF.
    Dates: NOW, EXTRACT, DATE_TRUNC, MAKE_DATE. Cast types with :: or CAST(...).

  1. Conditional & NULL Handling CASE builds conditional expressions; COALESCE picks the first non-NULL value; NULLIF(a,b) returns NULL if a=b (useful for safe division).

  1. DDL (Data Definition Language) Structures & Indexes Define and evolve schema: CREATE/ALTER/DROP TABLE, constraints (PRIMARY KEY, UNIQUE, NOT NULL, CHECK), and CREATE INDEX to speed up common filters/joins.

  1. Transactions & Security Wrap changes in BEGIN ... COMMIT (or ROLLBACK). Use SAVEPOINT for partial rollbacks. Control access with GRANT / REVOKE.

  1. Loading & Export Bulk load/export CSV using COPY (server-side) or \copy (psql client-side). On Windows, prefer forward slashes (C:/...), and if permissions fail, move files to C:/temp/.

  1. Introspection & “Help” Explore metadata via information_schema (tables/columns) or psql meta-commands (\dt, \d ghs, \dn, \l). SHOW search_path; reveals the active schema order.

📋 Summary Table — Operators, Functions & Patterns

Abbreviation used below: ghs = public.global_health_statistics

Operator / Function / Pattern What it does Tiny example (PostgreSQL)
SELECT ... FROM Retrieve columns/rows SELECT country, year_event FROM ghs;
WHERE Filter rows ... FROM ghs WHERE year_event >= 2018;
ORDER BY Sort output ... ORDER BY incidence DESC;
LIMIT / OFFSET Cap rows / paginate ... ORDER BY dalys DESC LIMIT 10;
DISTINCT Remove duplicates SELECT DISTINCT disease_category FROM ghs;
AS (aliases) Rename columns/tables SELECT dalys AS total_dalys FROM ghs;
Comparison ops =, <>, <, <=, >, >= WHERE country = 'Australia';
Logical ops AND, OR, NOT WHERE year_event>=2015 AND dalys>0;
Membership IN (...) WHERE country IN ('Australia','USA');
Range BETWEEN a AND b WHERE year_event BETWEEN 2015 AND 2024;
Pattern match LIKE, ILIKE WHERE disease_name ILIKE 'influenza%';
Regex (PG) ~, ~* WHERE disease_name ~* 'flu|covid';
Null checks IS NULL, IS NOT NULL WHERE incidence IS NULL;
String concat || SELECT country||' - '||disease_name FROM ghs;
Casting ::type or CAST() SELECT incidence::numeric FROM ghs;
Aggregates COUNT/SUM/AVG/MIN/MAX SELECT SUM(dalys) FROM ghs;
GROUP BY Define groups ... GROUP BY country, year_event;
HAVING Filter groups ... GROUP BY country HAVING SUM(dalys)>0;
JOIN (all) Combine tables: INNER/LEFT/RIGHT/FULL/CROSS SELECT g.country,r.region FROM ghs g LEFT JOIN country_regions r ON r.country=g.country;
Subquery (IN) Filter by another query WHERE country IN (SELECT country FROM focus_list);
CTE (WITH) Name intermediate result WITH recent AS (SELECT * FROM ghs WHERE year_event>=2020) SELECT * FROM recent;
Set ops UNION, UNION ALL, INTERSECT, EXCEPT (SELECT country FROM ghs WHERE dalys>0) INTERSECT (SELECT country FROM ghs WHERE incidence>0);
Window OVER(...) Rank/totals per partition SUM(dalys) OVER (PARTITION BY country);
Ranking ROW_NUMBER, RANK RANK() OVER (PARTITION BY country ORDER BY SUM(dalys) DESC);
LAG/LEAD Prior/next row value LEAD(incidence) OVER (PARTITION BY country ORDER BY year_event);
CASE Conditional logic CASE WHEN mortality_pct>5 THEN 'High' ELSE 'Low' END;
COALESCE / NULLIF Null handling & safe division COALESCE(incidence,0)/NULLIF(population_affected,0);
Strings UPPER/LOWER/TRIM/LENGTH/REPLACE SELECT UPPER(country), LENGTH(disease_name) FROM ghs;
Dates NOW/EXTRACT/DATE_TRUNC/MAKE_DATE DATE_TRUNC('year', MAKE_DATE(year_event,1,1));
DML INSERT/UPDATE/DELETE/MERGE UPDATE ghs SET incidence=0 WHERE incidence IS NULL;
DDL CREATE/ALTER/DROP TABLE ALTER TABLE ghs ADD COLUMN notes TEXT;
Constraints PRIMARY KEY/UNIQUE/CHECK/NOT NULL ALTER TABLE ghs ADD CHECK (mortality_pct BETWEEN 0 AND 100);
Indexes CREATE INDEX CREATE INDEX ON ghs(country, year_event);
Transactions BEGIN/COMMIT/ROLLBACK/SAVEPOINT BEGIN; UPDATE ghs SET dalys=0; ROLLBACK;
Security GRANT/REVOKE GRANT SELECT ON ghs TO analyst_role;
Load/Export COPY / \copy COPY ghs FROM 'C:/temp/file.csv' WITH (CSV,HEADER);
Introspection information_schema, psql \d SELECT column_name FROM information_schema.columns WHERE table_name='global_health_statistics';
Comments --, /* ... */ -- count rows

Notes to Keep Handy

  • Prefer explicit joins with clear keys; avoid ambiguous SELECT * in production code.
  • List columns in COPY for stable mapping if CSV order changes.
  • Use forward slashes on Windows paths with COPY (C:/...).
  • Create composite indexes to match common filters (e.g., (country, year_event)).
  • Always sanity-check results with quick counts/limits before heavy analysis.

Part II — Connecting with R, Rmarkdown, Quarto

Why connect SQL with R (and publish via .Rmd/.qmd)?

Modern analytics benefits from using SQL + R together: - SQL pushes heavy filtering/aggregation to the database (fast, scalable, auditable). - R adds wrangling, visualization, statistics, and reproducible documents.

Why use R Markdown (.Rmd) or Quarto (.qmd):
- Reproducibility: prose + code + outputs in one file; anyone can re-run end to end.
- Transparency: every figure or table is tied to the exact SQL/R code that produced it.
- Shareability: render to HTML/PDF/Word for the web or submission.
- Multi-language: mix R chunks and SQL chunks; SQL runs through a live DBI connection created in R.
- Automation: parameterized reports, scheduled renders, version control.

How SQL chunks work in Quarto/R Markdown
1) Create a DBI connection object in an R chunk (e.g., con <- dbConnect(...)).
2) Use SQL chunks with connection = con to run SQL on that database.
3) Outputs (tables) are printed inline under each SQL chunk.

🔐 Security tip: Avoid committing passwords to public repos.

Execution order & prerequisites:
- Install required R packages (DBI, RSQLite, RPostgres, readr, tidyverse, dbplyr).
- Ensure your database server (PostgreSQL) is running when connecting to it.
- Verify file paths (e.g., C:/temp/Global_Health_Statistics.csv) exist on your machine.
- Render top-to-bottom so the connection objects exist before SQL chunks that use them.

R Markdown & Quarto Options (R + SQL)

This section summarizes chunk options and syntax when building reports that mix R chunks and SQL chunks (e.g., ``{sql, connection = con}). The goal is to control execution, display, and reproducibility, while running queries directly against a database and printing results inline.

How SQL chunks work (at a glance) - Open a DBI connection in an R chunk (e.g., con <- dbConnect(...)). - Run queries in SQL chunks using connection = con. - Options like echo, eval, include, warning, message, error, and cache behave the same as in R chunks. - SQL chunk outputs render as tables; you can also capture them into R objects with output.var.

Summary: Quarto/R Markdown Chunk Options (R & SQL)

Feature Quarto / knitr Syntax Description (applies to R & SQL chunks)
Code chunk (R) {r, warning=FALSE, message=FALSE} Execute R code; suppress messages/warnings for clean output.
Code chunk (SQL) {sql, connection = con} Execute SQL against a DBI connection object created in an R chunk.
Show/hide code echo: true/false Display the source code (true) or hide it (false).
Run/skip code eval: true/false Execute the chunk or show code without running it.
Hide code & output include: false Run side effects (e.g., temp tables) but don’t print code or results.
Control results results: 'markup'|'asis'|'hide' Choose how printed results appear.
Messages & warnings message: false, warning: false Suppress messages/warnings in output.
Error handling error: true/false If true, continue rendering and show the error text.
Caching cache: true Reuse results when inputs unchanged (be careful if DB data changes).
Chunk labels {r my-label}, {sql my-label, connection = con} Name chunks for organization, cross-refs, and cache keys.
Figures (R) fig.width, fig.height, fig.align Size/alignment for plots created in R chunks.
Table printing YAML df-print: paged Global HTML table style; affects SQL outputs too.
Output formats YAML format: Select html, pdf, docx, etc., plus format-specific options.

SQL–specific conveniences

SQL Feature Syntax What it does
Bind a connection {sql, connection = con} Runs the SQL chunk using the DBI connection defined in an R chunk.
Capture result output.var = 'df' Saves the query output into an R object (e.g., df).
Show query only eval: false Display the SQL but do not execute (useful for teaching/drafts).
Show result only echo: false Execute query and print just the table (hide the SQL text).
Hidden setup include: false Execute without showing anything (e.g., temp objects).
Resilient to errors error: true Don’t halt the build on a chunk error (prints the error).

Examples

# DBI connection created once and reused by SQL chunks
library(DBI); 
library(RPostgres)
con <- dbConnect(
  RPostgres::Postgres(),
  dbname = "SQL_Data_Manipulation",
  host   = "localhost", port = 5432,
  user   = "postgres",  password = Sys.getenv("PGPASSWORD")
)

Abbreviations & acronyms used in this section

  • SQL — Structured Query Language
  • RDBMS — Relational Database Management System
  • ACID — Atomicity, Consistency, Isolation, Durability
  • DBaaS — Database as a Service (e.g., RDS/Azure/GCP)
  • GUI — Graphical User Interface (e.g., pgAdmin)
  • CLI — Command-Line Interface (e.g., psql)
  • CTE — Common Table Expression (WITH, WITH RECURSIVE)
  • DDL — Data Definition Language (CREATE/ALTER/DROP)
  • DML — Data Manipulation Language (SELECT/INSERT/UPDATE/DELETE/MERGE)
  • DCL — Data Control Language (GRANT/REVOKE)
  • PK — Primary Key
  • FK — Foreign Key
  • MV — Materialized View
  • JSON/JSONB — JavaScript Object Notation (text/binary)
  • GIN — Generalized Inverted Index (commonly for jsonb)
  • WAL — Write-Ahead Log (durability in Postgres)
  • QMD/RMD — Quarto/ R Markdown documents

Part III — Data Loading & Exploration

  • Table creation (DDL) & data types
  • CSV import: COPY / \copy and permissions
  • Basic exploration: previews, counts, distincts, NULL checks

Create the Project Database (with Australian locale)

We’ll call the project SQL_Data_Manipulation.

Note: You created the database with quoted/mixed case. In PostgreSQL, database names are case-sensitive if created with quotes. Always use the exact casing SQL_Data_Manipulation in tools/connectors.

Run this from the postgres database (or any admin DB) in pgAdmin Query Tool:

-- Create database with English_Australia.1252 locale (Windows)
-- If it already exists, skip this block.
CREATE DATABASE "SQL_Data_Manipulation"
WITH
OWNER = postgres
ENCODING = 'UTF8'
LC_COLLATE = 'English_Australia.1252'
LC_CTYPE   = 'English_Australia.1252'
LOCALE_PROVIDER = 'libc'
TABLESPACE = pg_default
CONNECTION LIMIT = -1
IS_TEMPLATE = FALSE;

COMMENT ON DATABASE "SQL_Data_Manipulation"
IS 'This is the DB for SQL data manipulation project.';

After creation: In pgAdmin, right-click your server → Refresh, then select database SQL_Data_Manipulation and open Query Tool from that database node so subsequent SQL runs in the right place.

💡 Locale note (Windows): English_Australia.1252 controls collation/ctype (sorting/character classification). If creation fails, verify that locale is installed on the OS.

  1. Expand your server → right-click DatabasesCreate > Database…
  2. Database: SQL_Data_Manipulation_GuideSave.

(Optional) Set this database as the Query Tool target using the dropdown in the toolbar.

🧱 Create the Target Table (in SQL_Data_Manipulation)

-- Ensure you're connected to the "SQL_Data_Manipulation" database before running this.
CREATE TABLE IF NOT EXISTS public.global_health_statistics (
    country TEXT,
    year_event INT,
    disease_name TEXT,
    disease_category TEXT,
    prevalence NUMERIC,
    incidence NUMERIC,
    mortality_pct NUMERIC,
    age_group TEXT,
    gender TEXT,
    population_affected BIGINT,
    healthcare_access TEXT,
    dr_per_1000 NUMERIC,
    hospital_beds_per_1000 NUMERIC,
    treatment_type TEXT,
    average_treatment_cost_usd NUMERIC,
    availability_of_vaccines_treatment TEXT,
    recovery_rate_pct NUMERIC,
    dalys NUMERIC,
    improvement_in_5_years_pct NUMERIC,
    per_capita_income_usd NUMERIC,
    education_index NUMERIC,
    urbanization_rate_pct NUMERIC
);

📥 Importing Data — Options and Tips

Importing Data Files into SQL

Before analyzing data in SQL, we need to bring it into a database table.
Unlike R or Python, SQL engines do not read files directly; instead, they rely on database commands (COPY, \copy, BULK INSERT, or external tools) to import files into a table.

Below is a summary of how different file types can be imported into common SQL engines.


Data Type SQL Import Method Database Example
CSV COPY or \copy (Postgres), BULK INSERT (SQL Server), .import (SQLite) PostgreSQL, SQLite, SQL Server COPY mytable FROM 'file.csv' CSV HEADER;
TSV / Text (.tsv) Same as CSV but with tab delimiter specified PostgreSQL, SQLite COPY mytable FROM 'file.tsv' DELIMITER E'\t' CSV HEADER;
Excel (.xls/.xlsx) Convert to CSV first, then import (native SQL cannot parse Excel directly) All Save as CSV → COPY into SQL
ZIP archives Extract file first, then import csv into SQL All unzip data.zipCOPY mytable FROM 'file.csv' CSV HEADER;
Stata/SPSS (.dta/.sav) Convert to CSV using R/Python/StatTransfer, then load into SQL All R: write.csv(read_dta("file.dta"), "file.csv") → SQL COPY
Online files Download first (with wget/curl), then import csv All wget URL -O file.csv → SQL COPY

Two ways to load CSV data:

Option A — pgAdmin GUI (Import/Export)

  • Right-click table → View/Edit Data → All Rows

    Click Import/Export (📥)

  • Import tab: choose file, format CSV, check Header, set delimiter ,

**OK**

Option B — SQL Import

There are two SQL variants:

B1. Server-side COPY (runs on the DB server process)

Works in pgAdmin Query Tool.

The server must be able to read the file path. On Windows, the service account often cannot read Desktop/Documents.
  • Fix: move the file to a permissive folder like C:/temp/.
COPY public.global_health_statistics (
  country, year_event, disease_name, disease_category, prevalence, incidence, mortality_pct, 
  age_group, gender, population_affected, healthcare_access, dr_per_1000,
  hospital_beds_per_1000, treatment_type, average_treatment_cost_usd,
  availability_of_vaccines_treatment, recovery_rate_pct, dalys,
  improvement_in_5_years_pct, per_capita_income_usd, education_index, urbanization_rate_pct
)
FROM 'C:/temp/Global_Health_Statistics.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',', QUOTE '"', ESCAPE '"', ENCODING 'UTF8');

B2. Client-side (psql)

Only in psql (ie Use SQL Shell), not pgAdmin.

Avoids server-permission issues (can read Desktop paths).

\copy public.global_health_statistics (
  country, year_event, disease_name, disease_category, prevalence, incidence, mortality_pct, 
  age_group, gender, population_affected, healthcare_access, dr_per_1000,
  hospital_beds_per_1000, treatment_type, average_treatment_cost_usd,
  availability_of_vaccines_treatment, recovery_rate_pct, dalys,
  improvement_in_5_years_pct, per_capita_income_usd, education_index, urbanization_rate_pct
)
FROM 'C:/Users/User/Desktop/Materials_ Course and proposals/Course Related/Data_source/Global_Health_Statistics.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',', QUOTE '"', ESCAPE '"', ENCODING 'UTF8');

Tips (Windows paths)

Prefer forward slashes (C:/…) to avoid escaping backslashes.

If permission denied appears with COPY, move the file to C:/temp/ or use .

Method used in this project

Used: Option B1 — server-side COPY from C:/temp/Global_Health_Statistics.csv.

Result: COPY 1000000 → 1,000,000 rows imported successfully.

Where to See the Imported Data (Names & Columns)

In pgAdmin: Servers → Databases → SQL_Data_Manipulation → Schemas → public → Tables → global_health_statistics

Right-click → View/Edit Data → All Rows to preview.

Select the table → Columns tab (right pane) to see column names and types.


1). Importing data to R from local path

This chunk reads the CSV into R, opens a SQLite database file named SQL_Data_Manipulation, and writes the data as a SQLite table. It’s a lightweight, local option for quick iteration (no server required).

    # Clear environment and console
    rm(list = ls()) # To clear the environment (remove all objects in memory)
    cat("\014") # To clear the console (like wiping a whiteboard)
library(DBI)
library(RSQLite)
library(readr)
library(tidyverse)
library(dbplyr)

# Read CSV
global_health_statistics <- read_csv("C:/temp/Global_Health_Statistics.csv")

# Connect to SQLite
con <- dbConnect(RSQLite::SQLite(), dbname = "SQL_Data_Manipulation")

# Write table to DB
dbWriteTable(con, "global_health_statistics", global_health_statistics, overwrite = TRUE)

# knitr::opts_chunk$set(connection = "con")

Use SQL codes: from here and onwards

The next SQL chunk runs against the current con connection (SQLite here) and previews rows.

SELECT * FROM global_health_statistics LIMIT 10;
Displaying records 1 - 10
country year_event disease_name disease_category prevalence incidence mortality_pct age_group gender population_affected healthcare_access dr_per_1000 hospital_beds_per_1000 treatment_type average_treatment_cost_usd availability_of_vaccines_treatment recovery_rate_pct dalys improvement_in_5_years_pct per_capita_income_usd education_index urbanization_rate_pct
Italy 2013 Malaria Respiratory 0.95 1.55 8.42 0-18 Male 471007 57.74 3.34 7.58 Medication 21064 No 91.82 4493 2.16 16886 0.79 86.02
France 2002 Ebola Parasitic 12.46 8.63 8.75 61+ Male 634318 89.21 1.33 5.11 Surgery 47851 Yes 76.65 2366 4.82 80639 0.74 45.52
Turkey 2015 COVID-19 Genetic 0.91 2.35 6.22 36-60 Male 154878 56.41 4.07 3.49 Vaccination 27834 Yes 98.55 41 5.81 12245 0.41 40.20
Indonesia 2011 Parkinson’s Disease Autoimmune 4.68 6.29 3.99 0-18 Other 446224 85.20 3.18 8.44 Surgery 144 Yes 67.35 3201 2.22 49336 0.49 58.47
Italy 2013 Tuberculosis Genetic 0.83 13.59 7.01 61+ Male 472908 67.00 4.61 5.90 Medication 8908 Yes 50.06 2832 6.93 47701 0.50 48.14
Saudi Arabia 2011 Dengue Bacterial 10.99 6.49 4.64 61+ Female 479234 98.41 3.50 0.62 Therapy 42671 Yes 93.17 416 9.83 29597 0.46 56.50
USA 2013 Malaria Cardiovascular 18.42 6.33 9.33 61+ Female 289931 90.03 3.16 3.31 Surgery 15579 No 92.80 4535 0.89 60027 0.70 20.48
Nigeria 2007 Tuberculosis Neurological 3.48 5.71 1.21 0-18 Female 393296 75.60 0.54 3.54 Medication 15744 Yes 65.45 4584 9.81 23222 0.46 66.49
Italy 2000 Rabies Chronic 15.59 4.74 6.38 19-35 Female 253110 87.87 4.56 5.84 Therapy 7669 Yes 59.23 2253 9.92 30849 0.55 41.27
Australia 2006 Cholera Chronic 10.12 2.08 6.00 61+ Male 174143 95.90 4.63 6.01 Medication 9468 Yes 93.21 4694 2.96 68856 0.90 83.30
SELECT 
  COUNT(*) FILTER (WHERE country IS NULL)       AS null_country,
  COUNT(*) FILTER (WHERE disease_name IS NULL)  AS null_disease
FROM global_health_statistics;
1 records
null_country null_disease
0 0

2). Reading the data available at postgres in SQL Via R

This section opens a PostgreSQL connection using RPostgres to the database “SQL_Data_Manipulation”. Best practice is to keep credentials out of code (use env vars or keychain).

# install.packages(c("DBI","RPostgres","dplyr"))  # run once
library(DBI)
library(RPostgres)
library(dplyr)

# (Optional) keep credentials out of the script
Sys.setenv(PGUSER = "postgres", PGPASSWORD = "yourpassword")

con <- dbConnect(
  RPostgres::Postgres(),
  host   = "localhost",
  port   = 5432,
  dbname = "SQL_Data_Manipulation",   # matches your screenshot
  user   = Sys.getenv("PGUSER", "postgres"),
  password = Sys.getenv("PGPASSWORD", "yourpassword")
)

Pull the table from PostgreSQL into an R data frame (or use dbGetQuery to run a SQL SELECT).

global_health_statistics <- dbReadTable(con, Id(schema = "public", table = "global_health_statistics"))
# or: ghs <- dbGetQuery(con, "SELECT * FROM public.global_health_statistics;")

Reading and Importing Data

In SQL, importing means loading external files into a database table so they can be queried.
Unlike R, where we use functions like read_csv() or read_excel(), in SQL we rely on commands such as COPY, LOAD DATA INFILE, or BULK INSERT depending on the database system.

📊 Reading Common Data Types in SQL

Data Type SQL Command / Method Notes Example
CSV COPY (Postgres), LOAD DATA INFILE (MySQL), BULK INSERT (SQL Server) Most common tabular format COPY patients FROM '/path/patients.csv' CSV HEADER;
Excel (.xls/.xlsx) Not directly supported Convert to CSV or import via ETL tools
Tab-separated (.tsv) Same as CSV with delimiter \t COPY patients FROM '/path/file.tsv' DELIMITER E'\t' CSV HEADER;
Text (.txt) Load as delimited file Need to define delimiter/schema
Stata (.dta), SPSS (.sav), RDS Not supported natively Convert to CSV first (R/Python/SAS)
ZIP archives SQL cannot unzip Unzip externally, then import CSV
Online CSV/Excel Some cloud DBs (BigQuery, Snowflake) allow URL loading Otherwise: download first LOAD DATA ... FROM 'https://...'

Use SQL codes: from here and on wards: Run a SQL query against PostgreSQL

This demonstration shows how I use SQL to explore and validate a large dataset on global health statistics. Each section introduces the query with a short explanation, followed by the SQL code and notes about the expected output.

Data Exploration

Before analysis, it is important to preview the dataset to confirm structure and column names.
The following query selects the first 10 rows for each variable country, year_event, disease_name, prevalence, incidence, mortality_pct, and socio-economic indicators.

-- check some Sample rows
SELECT *
FROM public.global_health_statistics
LIMIT 10;
Displaying records 1 - 10
country year_event disease_name disease_category prevalence incidence mortality_pct age_group gender population_affected healthcare_access dr_per_1000 hospital_beds_per_1000 treatment_type average_treatment_cost_usd availability_of_vaccines_treatment recovery_rate_pct dalys improvement_in_5_years_pct per_capita_income_usd education_index urbanization_rate_pct
China 2015 Hepatitis Genetic 19.87 3.07 8.54 0-18 Male 979238 71.63 1.71 1.16 Therapy 33951 Yes 51.60 980 1.62 74848 0.84 51.33
Australia 2003 Asthma Bacterial 16.47 5.11 7.73 0-18 Female 983250 58.48 4.22 1.08 Therapy 18945 Yes 74.70 1794 7.15 59994 0.54 44.09
Germany 2003 Dengue Neurological 0.50 3.89 9.19 19-35 Male 995467 53.86 2.98 8.98 Therapy 16993 No 69.58 2569 0.05 16782 0.40 52.60
China 2002 Malaria Autoimmune 18.38 3.81 6.77 19-35 Female 990949 62.36 1.66 9.68 Therapy 17764 Yes 97.23 3047 5.84 89944 0.81 21.99
Russia 2007 COVID-19 Bacterial 15.03 12.88 3.54 61+ Other 996438 88.06 4.02 2.22 Vaccination 29279 No 56.19 4102 9.71 43114 0.49 57.31
Russia 2012 Polio Bacterial 16.17 9.98 3.28 19-35 Other 998885 80.62 3.92 3.00 Vaccination 34751 Yes 89.90 1876 8.80 40763 0.45 54.51
South Africa 2000 Hypertension Neurological 3.53 5.05 5.89 19-35 Male 977922 62.92 3.59 9.15 Therapy 44512 Yes 92.83 1045 5.52 57064 0.50 50.14
South Korea 2009 Cholera Bacterial 17.03 14.88 5.54 36-60 Male 993314 96.23 2.55 6.10 Therapy 25692 Yes 73.43 180 7.13 60687 0.53 30.15
Turkey 2017 Leprosy Bacterial 18.75 8.30 1.51 61+ Other 972077 75.82 4.16 8.22 Medication 2330 Yes 58.85 1940 8.76 96878 0.68 62.72
Indonesia 2019 Dengue Respiratory 15.21 3.26 9.92 61+ Male 991463 69.99 1.58 1.45 Therapy 23547 Yes 50.35 994 2.74 71648 0.60 32.33

Confirm how many rows exist in the table: In this case: 1,000,000 rows.

-- Total rows loaded
SELECT COUNT(*) AS rows_loaded
FROM public.global_health_statistics;
1 records
rows_loaded
10000

Check the earliest and latest years in the dataset: spans from 2000 to 2024.

SELECT MIN(year_event) AS min_year, MAX(year_event) AS max_year
FROM public.global_health_statistics;
1 records
min_year max_year
2000 2024

Look at the most recent events by sorting years in descending order:

SELECT country, year_event, incidence
FROM public.global_health_statistics
ORDER BY year_event DESC
LIMIT 10;
Displaying records 1 - 10
country year_event incidence
Indonesia 2024 9.01
Germany 2024 11.98
France 2024 12.53
Canada 2024 10.95
Nigeria 2024 5.49
Brazil 2024 7.82
Russia 2024 4.51
Japan 2024 3.38
UK 2024 10.58
Saudi Arabia 2024 4.10

Checking Data Completeness

Check for missing values at the PostgreSQL level: Both null_country and null_disease return 0.

SELECT 
  COUNT(*) FILTER (WHERE country IS NULL)       AS null_country,
  COUNT(*) FILTER (WHERE disease_name IS NULL)  AS null_disease
FROM public.global_health_statistics;
1 records
null_country null_disease
0 0

Creating a Basic Data Dictionary

-- PostgreSQL example: auto-generate a simple data dictionary
SELECT 
    table_name,
    column_name,
    data_type,
    is_nullable
FROM information_schema.columns
WHERE table_schema = 'public'
  AND table_name = 'patients';
0 records
table_name column_name data_type is_nullable

📊 Summary Table: SQL Equivalents

The table below provides SQL equivalents for common R exploration tasks.
It shows the task, the general SQL approach, and an example query.

Task SQL Approach Example
View structure of dataset Query metadata SELECT column_name, data_type FROM information_schema.columns
Summarize statistics Aggregate functions MIN(), MAX(), AVG(), STDDEV()
Preview first rows LIMIT SELECT * FROM patients LIMIT 10;
Preview last rows ORDER BY ... DESC LIMIT n SELECT * FROM patients ORDER BY id DESC LIMIT 8;
Count frequencies GROUP BY + COUNT() SELECT gender, COUNT(*) FROM patients GROUP BY gender;
Count rows COUNT(*) SELECT COUNT(*) FROM patients;
Count columns Metadata query SELECT COUNT(*) FROM information_schema.columns;
List variable names Metadata query SELECT column_name FROM information_schema.columns;
Check variable type Metadata query SELECT column_name, data_type FROM information_schema.columns;
Create data dictionary Metadata query (see metadata query example above)
Clean column/text values SQL functions UPPER(TRIM(col))

Part IV — Wrangling Core (SELECT / WHERE / CASE / JOIN)

Once a dataset has been imported and explored, the next step is wrangling — reshaping and preparing the data for analysis.
In SQL, the core tools for wrangling are:

  • Selecting & aliasing (choosing columns, renaming for clarity, creating computed columns)
  • Filtering, sorting, limiting (subsetting rows)
  • Conditional logic (CASE, COALESCE, NULLIF)
  • Joins (combining data across multiple tables)

📊 Summary Table — Conditional Logic & Joins

The table below summarizes key SQL wrangling patterns using the global_health_statistics dataset.
It highlights creating variables with CASE, handling missing values, and common join strategies, with representative examples.

Concept Description Example SQL
Binary indicator with CASE Recode into two groups (like ifelse() in R). sql<br>SELECT disease_name, mortality_pct,<br> CASE WHEN mortality_pct > 20 THEN 'High Mortality'<br> ELSE 'Low/Moderate Mortality' END AS mortality_group<br>FROM public.global_health_statistics<br>LIMIT 10;
Multi-category recoding Multiple conditions, like case_when() in R. sql<br>SELECT disease_name, mortality_pct,<br> CASE WHEN mortality_pct < 5 THEN 'Low'<br> WHEN mortality_pct BETWEEN 5 AND 20 THEN 'Moderate'<br> WHEN mortality_pct > 20 THEN 'High'<br> ELSE 'Unknown' END AS mortality_category<br>FROM public.global_health_statistics<br>LIMIT 10;
Buckets based on numeric ranges Group continuous values into buckets. sql<br>SELECT country, disease_name, year_event, population_affected,<br> CASE WHEN population_affected IS NULL THEN 'unknown'<br> WHEN population_affected < 1_000 THEN 'very low'<br> WHEN population_affected < 10_000 THEN 'low'<br> WHEN population_affected < 100_000 THEN 'moderate'<br> WHEN population_affected < 1_000_000 THEN 'high'<br> ELSE 'very high' END AS burden_bucket<br>FROM public.global_health_statistics<br>ORDER BY year_event DESC, population_affected DESC<br>LIMIT 10;
Combine multiple variables Conditional logic across >1 column. sql<br>SELECT disease_name, age_group,<br> CASE WHEN disease_category = 'Infectious' AND age_group = '0-14' THEN 'Child Infectious'<br> WHEN disease_category = 'Infectious' AND age_group = '65+' THEN 'Elderly Infectious'<br> WHEN disease_category = 'Non-communicable' AND age_group = '15-64' THEN 'Adult NCD'<br> ELSE 'Other' END AS combined_category<br>FROM public.global_health_statistics<br>LIMIT 10;
Handle missing values COALESCE replaces NULL, NULLIF avoids divide-by-zero. sql<br>SELECT country, COALESCE(healthcare_access,'Unknown') AS healthcare_access<br>FROM public.global_health_statistics<br>LIMIT 10;
LEFT JOIN Keep all fact rows, enrich where possible. sql<br>SELECT g.country, COALESCE(cr.region,'Unknown') AS region,<br> g.disease_name, g.year_event, g.population_affected<br>FROM public.global_health_statistics g<br>LEFT JOIN country_region cr ON cr.country = g.country<br>WHERE g.year_event BETWEEN 2018 AND 2022<br>ORDER BY g.year_event DESC, region, g.population_affected DESC<br>LIMIT 15;
FULL JOIN Keep all rows from both tables; useful for reconciliation. sql<br>SELECT COALESCE(g.country, cr.country) AS country_key,<br> cr.region, g.disease_name, g.year_event, g.population_affected,<br> CASE WHEN g.country IS NULL THEN 'missing_in_facts'<br> WHEN cr.country IS NULL THEN 'missing_in_dim'<br> ELSE 'matched' END AS join_status<br>FROM public.global_health_statistics g<br>FULL OUTER JOIN country_region cr ON cr.country = g.country<br>LIMIT 10;
Anti-join Find fact rows with no matching dimension. sql<br>SELECT g.*<br>FROM public.global_health_statistics g<br>WHERE NOT EXISTS (SELECT 1 FROM country_region cr WHERE cr.country = g.country)<br>LIMIT 10;
Semi-join Keep fact rows that have a match, without returning dimension columns. sql<br>SELECT g.country, g.disease_name, g.year_event, g.population_affected<br>FROM public.global_health_statistics g<br>WHERE EXISTS (SELECT 1 FROM country_region cr WHERE cr.country = g.country)<br>LIMIT 10;

For demonstration, I will continue using the global_health_statistics dataset.


1. Selecting & Aliasing

Aliasing allows us to rename columns or create new computed fields directly in queries.

Here, I renamed columns (country → country_name) and created a computed column to express affected population in millions. This makes query results easier to interpret when preparing reports.

-- Select key columns with aliases and a computed column
SELECT 
    country AS country_name,
    disease_name AS disease,
    year_event AS year,
    population_affected,
    ROUND(CAST(population_affected AS numeric) / 1000000, 2) AS population_millions
FROM public.global_health_statistics
LIMIT 10;
Displaying records 1 - 10
country_name disease year population_affected population_millions
Saudi Arabia Parkinson’s Disease 2005 994863 0.99
Australia Measles 2003 967319 0.97
Italy Ebola 2003 983403 0.98
India Ebola 2016 988479 0.99
Turkey Hepatitis 2001 995574 1.00
Australia Asthma 2007 997674 1.00
Saudi Arabia Zika 2021 974805 0.97
USA Cholera 2020 997379 1.00
Saudi Arabia HIV/AIDS 2012 984721 0.98
UK COVID-19 2023 999050 1.00

2. Filtering, Sorting, Limiting

Filtering rows focuses on subsets of interest. Sorting orders results, and limiting restricts the number of records returned.

The code below extracts the countries most affected by Tuberculosis after 2015, ranked by population impacted. Filtering with WHERE, ordering with ORDER BY, and subsetting with LIMIT form the core of SQL wrangling.

-- Top 10 most affected records for Tuberculosis after 2015
SELECT 
    country,
    year_event,
    disease_name,
    population_affected
FROM public.global_health_statistics
WHERE disease_name = 'Tuberculosis'
  AND year_event > 2015
ORDER BY population_affected DESC
LIMIT 10;
Displaying records 1 - 10
country year_event disease_name population_affected
USA 2021 Tuberculosis 999879
South Korea 2022 Tuberculosis 999833
UK 2019 Tuberculosis 999818
South Africa 2018 Tuberculosis 999784
Indonesia 2018 Tuberculosis 999769
South Korea 2019 Tuberculosis 999739
South Korea 2024 Tuberculosis 999711
India 2016 Tuberculosis 999603
India 2023 Tuberculosis 999487
Italy 2022 Tuberculosis 999477

When reading results, you may prefer explicit predicates, deterministic ordering, and tight limits while iterating.

--explicit predicates, deterministic ordering, and tight limits while iterating.
SELECT country, disease_name, year_event, population_affected
FROM public.global_health_statistics
WHERE year_event BETWEEN 2010 AND 2020             -- time window
  AND disease_name ILIKE '%influenza%'             -- case-insensitive contains
  AND population_affected IS NOT NULL              -- avoid surprises in math
ORDER BY year_event DESC, population_affected DESC -- deterministic, business-first order
LIMIT 10;
Displaying records 1 - 10
country disease_name year_event population_affected
Brazil Influenza 2020 999885
UK Influenza 2020 999365
South Korea Influenza 2020 999300
China Influenza 2020 998855
Nigeria Influenza 2020 996164
France Influenza 2020 995307
Italy Influenza 2020 994765
Indonesia Influenza 2020 994646
Germany Influenza 2020 994263
Australia Influenza 2020 992359

Filtering by specific letters, words or names
Filtering with LIKE and ILIKE

In PostgreSQL you can use ILIKE instead of LIKE to make the search case-insensitive.

Filtering text values is a common requirement in health datasets.
Often we need to retrieve records where a disease name, country, or other text field contains, starts with, or ends with a certain word.

In SQL, this is achieved using the LIKE operator.
- % matches any sequence of characters (including none).
- _ matches a single character.

Below is a summary table of common filtering patterns:

Pattern Description SQL Example
%Alzheimer% Matches any value containing “Alzheimer” SELECT * FROM public.global_health_statistics WHERE disease_name LIKE '%Alzheimer%';
Alzheimer% Matches values that start with “Alzheimer” SELECT * FROM public.global_health_statistics WHERE disease_name LIKE 'Alzheimer%';
%Alzheimer Matches values that end with “Alzheimer” SELECT * FROM public.global_health_statistics WHERE disease_name LIKE '%Alzheimer';
%flu% Matches any value containing “flu” (e.g. “Influenza”, “Avian flu”) SELECT * FROM public.global_health_statistics WHERE disease_name LIKE '%flu%';
_alaria Matches a 7-letter word ending in “alaria” (e.g. “Malaria”) SELECT * FROM public.global_health_statistics WHERE disease_name LIKE '_alaria';
C% Matches any value starting with the letter “C” SELECT * FROM public.global_health_statistics WHERE disease_name LIKE 'C%';

– Find all records where the disease name starts with “Influenza”

SELECT *
FROM public.global_health_statistics
WHERE disease_name LIKE 'Influenza%';
Displaying records 1 - 10
country year_event disease_name disease_category prevalence incidence mortality_pct age_group gender population_affected healthcare_access dr_per_1000 hospital_beds_per_1000 treatment_type average_treatment_cost_usd availability_of_vaccines_treatment recovery_rate_pct dalys improvement_in_5_years_pct per_capita_income_usd education_index urbanization_rate_pct
South Korea 2021 Influenza Autoimmune 16.40 9.03 8.06 0-18 Female 991918 96.11 2.54 7.47 Medication 24235 Yes 84.39 1895 5.32 83332 0.62 41.45
Russia 2024 Influenza Neurological 9.23 11.75 7.85 36-60 Male 995982 93.26 4.22 6.31 Vaccination 8893 No 79.07 2119 2.34 93844 0.66 74.08
Mexico 2007 Influenza Autoimmune 14.52 14.32 9.96 0-18 Female 985023 89.24 3.35 3.56 Surgery 17797 Yes 67.76 1859 3.29 45837 0.63 31.86
Nigeria 2016 Influenza Neurological 4.46 0.88 4.07 61+ Male 971438 99.91 1.35 5.23 Medication 16069 Yes 54.47 4392 3.98 42556 0.83 70.52
Russia 2005 Influenza Chronic 11.17 5.11 4.62 61+ Other 978667 63.33 3.12 0.79 Vaccination 39282 Yes 50.16 1032 0.71 19698 0.78 37.30
Saudi Arabia 2016 Influenza Respiratory 15.43 0.67 3.53 36-60 Female 993985 61.94 4.18 3.19 Vaccination 20822 No 68.19 3037 6.43 40410 0.65 75.44
Australia 2005 Influenza Autoimmune 14.17 6.03 3.00 36-60 Other 998326 67.87 4.30 8.71 Medication 41420 Yes 73.73 3639 9.46 31151 0.52 56.19
Germany 2008 Influenza Genetic 19.75 0.87 1.31 19-35 Other 976737 52.15 1.74 0.63 Medication 27479 No 84.86 2169 9.38 21133 0.89 80.84
France 2014 Influenza Infectious 5.38 13.02 4.03 61+ Female 966348 52.44 4.15 9.89 Surgery 18901 Yes 75.84 990 0.95 81729 0.65 39.77
Brazil 2013 Influenza Metabolic 13.00 3.53 8.28 36-60 Female 963595 61.27 2.40 4.45 Vaccination 36551 No 82.40 2619 1.95 47881 0.63 74.25

Use ILIKE instead of LIKE to make the search case-insensitive:

-- Case-insensitive match for "flu"
SELECT *
FROM public.global_health_statistics
WHERE disease_name ILIKE '%flu%';
Displaying records 1 - 10
country year_event disease_name disease_category prevalence incidence mortality_pct age_group gender population_affected healthcare_access dr_per_1000 hospital_beds_per_1000 treatment_type average_treatment_cost_usd availability_of_vaccines_treatment recovery_rate_pct dalys improvement_in_5_years_pct per_capita_income_usd education_index urbanization_rate_pct
France 2000 Influenza Neurological 10.98 12.50 6.16 19-35 Male 972289 90.77 0.81 8.97 Therapy 20977 Yes 68.52 240 7.06 50477 0.52 54.33
Saudi Arabia 2012 Influenza Respiratory 1.42 3.61 6.36 61+ Other 996854 64.29 2.52 7.32 Therapy 27310 Yes 88.74 4757 0.45 18559 0.87 20.21
UK 2004 Influenza Cardiovascular 5.65 7.87 0.45 61+ Male 982164 50.52 4.30 6.90 Vaccination 357 Yes 50.72 2576 8.51 96145 0.72 45.57
South Africa 2003 Influenza Genetic 18.19 0.77 1.28 19-35 Male 984296 78.75 4.08 3.62 Vaccination 36018 Yes 94.75 1958 4.49 24740 0.69 45.43
Saudi Arabia 2011 Influenza Neurological 18.44 7.17 8.03 36-60 Female 991688 75.60 4.46 2.18 Therapy 5946 No 63.11 2149 7.49 60796 0.51 49.90
Turkey 2019 Influenza Bacterial 17.33 1.57 8.76 0-18 Female 990389 69.89 4.80 9.46 Therapy 5303 Yes 62.03 709 1.20 54889 0.49 29.12
Italy 2014 Influenza Genetic 0.58 11.24 2.86 61+ Male 971317 91.63 2.32 8.15 Vaccination 48912 Yes 85.41 3653 8.92 53230 0.49 40.62
Germany 2024 Influenza Chronic 15.96 2.96 4.53 0-18 Female 994346 64.58 4.77 7.13 Vaccination 43370 No 54.10 4054 6.88 72122 0.47 26.04
Australia 2016 Influenza Parasitic 1.23 0.33 7.45 36-60 Other 996915 55.93 2.55 2.72 Therapy 42703 No 60.02 336 4.81 28099 0.66 85.06
China 2008 Influenza Parasitic 12.37 2.22 4.60 19-35 Male 992246 58.01 4.08 4.98 Vaccination 4833 No 98.34 117 2.56 53237 0.59 78.22
-- Find all records where the disease name contains "Alzheimer"
SELECT *
FROM public.global_health_statistics
WHERE disease_name LIKE '%Alzheimer%';
Displaying records 1 - 10
country year_event disease_name disease_category prevalence incidence mortality_pct age_group gender population_affected healthcare_access dr_per_1000 hospital_beds_per_1000 treatment_type average_treatment_cost_usd availability_of_vaccines_treatment recovery_rate_pct dalys improvement_in_5_years_pct per_capita_income_usd education_index urbanization_rate_pct
India 2022 Alzheimer’s Disease Neurological 1.39 6.56 9.98 19-35 Female 993771 89.40 3.56 4.85 Therapy 34353 No 65.82 3919 3.99 46885 0.63 33.94
Nigeria 2002 Alzheimer’s Disease Viral 15.92 12.36 6.57 19-35 Male 982502 76.19 2.47 6.80 Surgery 25648 Yes 98.90 1757 8.64 89792 0.79 46.24
Turkey 2017 Alzheimer’s Disease Respiratory 15.07 14.96 8.67 0-18 Female 993217 79.87 2.17 2.43 Therapy 23187 No 71.03 2156 8.09 28630 0.56 83.17
Germany 2023 Alzheimer’s Disease Viral 17.64 3.77 5.58 19-35 Other 982340 81.97 2.35 8.19 Therapy 45121 No 65.34 1861 3.76 74511 0.65 62.16
Japan 2020 Alzheimer’s Disease Autoimmune 7.30 6.18 4.46 36-60 Female 994553 53.65 2.31 4.98 Surgery 36987 No 79.00 2286 2.89 80599 0.71 34.09
Japan 2015 Alzheimer’s Disease Viral 14.21 8.87 8.83 61+ Other 987549 87.25 2.45 7.87 Vaccination 2382 No 95.62 1509 3.66 41121 0.82 68.78
Saudi Arabia 2005 Alzheimer’s Disease Genetic 18.55 11.63 7.36 19-35 Other 989098 81.79 2.71 6.32 Surgery 38087 Yes 81.21 420 6.04 67373 0.74 32.60
USA 2012 Alzheimer’s Disease Parasitic 15.02 0.81 6.39 0-18 Male 978985 82.86 3.58 2.62 Therapy 5799 Yes 71.14 1636 9.24 14485 0.63 33.76
Brazil 2014 Alzheimer’s Disease Bacterial 9.72 0.82 4.23 19-35 Other 999386 99.29 3.65 1.96 Vaccination 32450 No 72.86 2464 4.40 18578 0.51 57.38
Indonesia 2011 Alzheimer’s Disease Respiratory 14.59 1.88 1.37 0-18 Male 995566 63.94 3.33 5.84 Surgery 17167 Yes 70.50 1986 6.51 27794 0.89 73.16

Top-N per group (practical)

Top 3 countries per year by affected population.

WITH ranked AS (
  SELECT
    country,
    year_event,
    population_affected,
    ROW_NUMBER() OVER (
      PARTITION BY year_event
      ORDER BY population_affected DESC, country ASC
    ) AS rn
  FROM public.global_health_statistics
  WHERE population_affected IS NOT NULL
)
SELECT country, year_event, population_affected
FROM ranked
WHERE rn <= 3
ORDER BY year_event, rn, country;
Displaying records 1 - 10
country year_event population_affected
Australia 2000 999987
Brazil 2000 999982
Canada 2000 999970
China 2001 999989
Italy 2001 999986
Russia 2001 999979
UK 2002 999999
France 2002 999983
China 2002 999963
Japan 2003 999999

3. Creating variables and Conditional Logic

SQL lets us recode or transform values inline.

Using CASE

CASE is the SQL equivalent of ifelse() in R or if-else in Python, and it’s essential for creating categorical groupings.

-- Create a binary indicator for high mortality diseases
SELECT 
    disease_name,
    mortality_pct,
    CASE 
        WHEN mortality_pct > 20 THEN 'High Mortality'
        ELSE 'Low/Moderate Mortality'
    END AS mortality_group
FROM public.global_health_statistics
LIMIT 10;
Displaying records 1 - 10
disease_name mortality_pct mortality_group
Cancer 4.58 Low/Moderate Mortality
Hepatitis 3.00 Low/Moderate Mortality
Tuberculosis 7.15 Low/Moderate Mortality
HIV/AIDS 6.04 Low/Moderate Mortality
Hypertension 3.42 Low/Moderate Mortality
HIV/AIDS 1.83 Low/Moderate Mortality
Parkinson’s Disease 1.04 Low/Moderate Mortality
Diabetes 7.10 Low/Moderate Mortality
Cancer 0.55 Low/Moderate Mortality
Cholera 6.68 Low/Moderate Mortality

Multi-Category Recoding

A CASE statement with multiple WHEN ... THEN ... clauses.

Suppose we want to categorize diseases into Low, Moderate, and High mortality risk groups based on the mortality_pct variable.

-- Multi-category classification of mortality levels
SELECT 
    disease_name,
    mortality_pct,
    CASE 
        WHEN mortality_pct < 5 THEN 'Low'
        WHEN mortality_pct BETWEEN 5 AND 20 THEN 'Moderate'
        WHEN mortality_pct > 20 THEN 'High'
        ELSE 'Unknown'
    END AS mortality_category
FROM public.global_health_statistics
LIMIT 10;
Displaying records 1 - 10
disease_name mortality_pct mortality_category
Zika 3.08 Low
Cholera 2.69 Low
HIV/AIDS 0.22 Low
Asthma 8.27 Moderate
Alzheimer’s Disease 3.80 Low
COVID-19 0.89 Low
Tuberculosis 3.45 Low
HIV/AIDS 4.25 Low
Leprosy 4.55 Low
Diabetes 1.42 Low
SELECT
  country,
  disease_name,
  year_event,
  population_affected,
  CASE
    WHEN population_affected IS NULL               THEN 'unknown'
    WHEN population_affected <  1_000              THEN 'very low'
    WHEN population_affected < 10_000              THEN 'low'
    WHEN population_affected < 100_000             THEN 'moderate'
    WHEN population_affected < 1_000_000           THEN 'high'
    ELSE                                                'very high'
  END AS burden_bucket
FROM public.global_health_statistics
ORDER BY year_event DESC, population_affected DESC
LIMIT 10;
Displaying records 1 - 10
country disease_name year_event population_affected burden_bucket
Argentina Measles 2024 999962 high
South Korea Polio 2024 999921 high
Turkey Influenza 2024 999910 high
India Dengue 2024 999855 high
India Polio 2024 999848 high
South Korea Asthma 2024 999812 high
Russia Polio 2024 999791 high
South Africa Cancer 2024 999754 high
Italy Diabetes 2024 999751 high
South Korea Tuberculosis 2024 999711 high

Combining Multiple Variables

We can even combine conditions from more than one variable. Here, we create a category based on both disease type and age group.

-- Conditional grouping using disease and age group
SELECT 
    disease_name,
    age_group,
    CASE 
        WHEN disease_category = 'Infectious' AND age_group = '0-14' THEN 'Child Infectious'
        WHEN disease_category = 'Infectious' AND age_group = '65+' THEN 'Elderly Infectious'
        WHEN disease_category = 'Non-communicable' AND age_group = '15-64' THEN 'Adult NCD'
        ELSE 'Other'
    END AS combined_category
FROM public.global_health_statistics
LIMIT 10;
Displaying records 1 - 10
disease_name age_group combined_category
Zika 36-60 Other
Malaria 36-60 Other
Measles 0-18 Other
Leprosy 36-60 Other
Malaria 0-18 Other
Influenza 61+ Other
Diabetes 36-60 Other
Polio 0-18 Other
Zika 36-60 Other
Leprosy 61+ Other

Handling Missing Values (COALESCE, NULLIF)

COALESCE replaces missing values with a default. NULLIF prevents division by zero by returning NULL instead.

-- Replace null healthcare access values with 'Unknown'
SELECT
    country,
    disease_name,
    year_event,
    COALESCE(healthcare_access::text, 'Unknown') AS healthcare_access
FROM public.global_health_statistics
LIMIT 10;
Displaying records 1 - 10
country disease_name year_event healthcare_access
Australia Polio 2020 60.15
Italy Zika 2004 86.37
Germany Leprosy 2021 82.38
Nigeria Dengue 2005 72.18
Germany Influenza 2012 64.23
Argentina Rabies 2014 68.28
UK Zika 2007 70.82
South Korea Alzheimer’s Disease 2003 78.57
Germany Zika 2016 97.23
Mexico Malaria 2020 84.61
-- Calculate incidence rate safely (avoid division by zero)
SELECT 
    country,
    disease_name,
    incidence / NULLIF(population_affected, 0) AS incidence_rate
FROM public.global_health_statistics
WHERE year_event = 2020
LIMIT 10;
Displaying records 1 - 10
country disease_name incidence_rate
Italy Ebola 9.7e-06
Turkey COVID-19 2.9e-06
Italy Leprosy 1.3e-06
South Korea Leprosy 4.6e-06
Argentina Hepatitis 6.9e-06
Mexico Measles 7.3e-06
Mexico COVID-19 8.0e-06
France Leprosy 3.9e-06
South Korea Ebola 1.2e-06
Turkey Measles 3.0e-07

4. Joins (INNER / LEFT / FULL)

To keep this self-contained, I’ll create two tables to demonstrate common join shapes:

country_region maps countries to regions.

disease_group maps diseases to broader categories

In a real project these would live in dim.* tables with clear stewardship.

CREATE TEMP TABLE country_region (
  country TEXT PRIMARY KEY,
  region  TEXT NOT NULL
);
INSERT INTO country_region(country, region) VALUES
  ('Australia','Oceania'),
  ('United States','North America'),
  ('Brazil','South America'),
  ('India','Asia'),
  ('Kenya','Africa');
CREATE TEMP TABLE disease_group (
  disease_name TEXT PRIMARY KEY,
  disease_group TEXT NOT NULL
);
INSERT INTO disease_group(disease_name, disease_group) VALUES
  ('Influenza A','Respiratory'),
  ('Influenza B','Respiratory'),
  ('COVID-19','Respiratory'),
  ('Dengue','Vector-borne'),
  ('Malaria','Vector-borne');

LEFT JOIN (keep all facts; enrich where possible)

Prefer this when facts must be retained even if the dimension is incomplete.

In the example below, LEFT JOIN ensures all health statistics are retained, even if metadata is missing.

SELECT
  g.country,
  COALESCE(cr.region, 'Unknown') AS region,
  g.disease_name,
  g.year_event,
  g.population_affected
FROM public.global_health_statistics AS g
LEFT JOIN country_region           AS cr
  ON cr.country = g.country
WHERE g.year_event BETWEEN 2018 AND 2022
ORDER BY g.year_event DESC, region, g.population_affected DESC
LIMIT 15;
Displaying records 1 - 10
country region disease_name year_event population_affected
India Asia Influenza 2022 998735
India Asia COVID-19 2022 998728
India Asia Cholera 2022 998331
India Asia Polio 2022 998188
India Asia Measles 2022 998057
India Asia Cancer 2022 996565
India Asia Hypertension 2022 994593
India Asia Diabetes 2022 994007
India Asia Alzheimer’s Disease 2022 993771
India Asia Zika 2022 993585

FULL JOIN: keep everything from both sides

Useful for reconciliation and data quality investigations.

SELECT
  COALESCE(g.country, cr.country) AS country_key,
  cr.region,
  g.disease_name,
  g.year_event,
  g.population_affected,
  CASE
    WHEN g.country IS NULL THEN 'missing_in_facts'
    WHEN cr.country IS NULL THEN 'missing_in_dim'
    ELSE 'matched'
  END AS join_status
FROM public.global_health_statistics AS g
FULL OUTER JOIN country_region      AS cr
  ON cr.country = g.country
ORDER BY join_status, country_key
LIMIT 10;
Displaying records 1 - 10
country_key region disease_name year_event population_affected join_status
Australia Oceania Hypertension 2004 995949 matched
Australia Oceania Hepatitis 2009 996791 matched
Australia Oceania COVID-19 2000 995669 matched
Australia Oceania Measles 2021 977797 matched
Australia Oceania Asthma 2010 982446 matched
Australia Oceania Malaria 2012 993091 matched
Australia Oceania Cholera 2009 993428 matched
Australia Oceania Asthma 2015 966733 matched
Australia Oceania Asthma 2013 998228 matched
Australia Oceania Leprosy 2011 990548 matched

Joining multiple dimensions

Add both region and a disease_group, then compute region-level totals.

SELECT
  cr.region,
  dg.disease_group,
  g.year_event,
  SUM(g.population_affected) AS population_total
FROM public.global_health_statistics AS g
LEFT JOIN country_region             AS cr
  ON cr.country = g.country
LEFT JOIN disease_group              AS dg
  ON dg.disease_name = g.disease_name
WHERE g.year_event >= 2015
GROUP BY cr.region, dg.disease_group, g.year_event
ORDER BY cr.region NULLS LAST, dg.disease_group NULLS LAST, g.year_event;
Displaying records 1 - 10
region disease_group year_event population_total
Asia Respiratory 2015 996752
Asia Respiratory 2016 988454
Asia Respiratory 2017 999807
Asia Respiratory 2018 997011
Asia Respiratory 2019 991913
Asia Respiratory 2020 997987
Asia Respiratory 2021 998308
Asia Respiratory 2022 998728
Asia Respiratory 2023 980264
Asia Respiratory 2024 994785

Anti-join pattern (find orphans)

Facts with unmapped countries (dimension coverage gap):

SELECT g.*
FROM public.global_health_statistics AS g
WHERE NOT EXISTS (
  SELECT 1
  FROM country_region AS cr
  WHERE cr.country = g.country
)
ORDER BY g.country, g.disease_name, g.year_event
LIMIT 10;
Displaying records 1 - 10
country year_event disease_name disease_category prevalence incidence mortality_pct age_group gender population_affected healthcare_access dr_per_1000 hospital_beds_per_1000 treatment_type average_treatment_cost_usd availability_of_vaccines_treatment recovery_rate_pct dalys improvement_in_5_years_pct per_capita_income_usd education_index urbanization_rate_pct
Argentina 2000 Alzheimer’s Disease Chronic 11.67 2.09 0.99 19-35 Male 999739 56.45 2.62 0.94 Vaccination 7431 Yes 93.08 2852 2.11 94376 0.45 89.55
Argentina 2001 Alzheimer’s Disease Genetic 7.53 1.16 1.60 61+ Female 995329 55.56 2.92 9.01 Therapy 22389 No 89.26 2828 9.99 32326 0.71 51.91
Argentina 2002 Alzheimer’s Disease Bacterial 19.84 2.36 7.98 36-60 Female 999312 80.47 1.66 7.84 Surgery 22607 Yes 52.38 4166 2.33 78110 0.82 71.93
Argentina 2003 Alzheimer’s Disease Respiratory 2.71 1.16 6.98 61+ Female 993597 69.29 4.80 1.09 Therapy 22557 No 90.07 3888 8.11 92623 0.83 62.30
Argentina 2004 Alzheimer’s Disease Viral 3.34 14.08 2.02 0-18 Male 991236 79.44 1.33 6.34 Surgery 18506 Yes 66.13 2646 5.26 70858 0.54 84.19
Argentina 2005 Alzheimer’s Disease Genetic 7.95 2.05 6.92 61+ Male 994878 55.81 2.91 3.04 Therapy 8444 Yes 61.27 1902 2.97 70939 0.55 68.29
Argentina 2006 Alzheimer’s Disease Cardiovascular 4.59 5.16 2.99 61+ Other 999475 63.33 4.97 3.40 Medication 14464 Yes 98.18 156 3.38 92196 0.55 33.07
Argentina 2007 Alzheimer’s Disease Neurological 16.36 1.39 3.74 61+ Other 968726 90.92 1.87 2.74 Medication 10705 No 59.49 1924 4.10 40606 0.78 69.94
Argentina 2008 Alzheimer’s Disease Neurological 9.55 4.24 5.27 0-18 Female 982297 68.95 2.43 1.69 Surgery 35053 Yes 82.13 3169 9.67 91423 0.71 78.84
Argentina 2009 Alzheimer’s Disease Autoimmune 11.05 3.43 7.97 36-60 Other 984953 57.13 4.31 5.23 Medication 28014 Yes 92.87 759 2.42 95653 0.75 51.54
  1. Anti-join via LEFT JOIN … IS NULL (equivalent)

SELECT g.*
FROM public.global_health_statistics AS g
LEFT JOIN country_region AS cr
  ON cr.country = g.country
WHERE cr.country IS NULL
ORDER BY g.country, g.disease_name, g.year_event
LIMIT 20;
Displaying records 1 - 10
country year_event disease_name disease_category prevalence incidence mortality_pct age_group gender population_affected healthcare_access dr_per_1000 hospital_beds_per_1000 treatment_type average_treatment_cost_usd availability_of_vaccines_treatment recovery_rate_pct dalys improvement_in_5_years_pct per_capita_income_usd education_index urbanization_rate_pct
Argentina 2000 Alzheimer’s Disease Chronic 11.67 2.09 0.99 19-35 Male 999739 56.45 2.62 0.94 Vaccination 7431 Yes 93.08 2852 2.11 94376 0.45 89.55
Argentina 2001 Alzheimer’s Disease Genetic 7.53 1.16 1.60 61+ Female 995329 55.56 2.92 9.01 Therapy 22389 No 89.26 2828 9.99 32326 0.71 51.91
Argentina 2002 Alzheimer’s Disease Bacterial 19.84 2.36 7.98 36-60 Female 999312 80.47 1.66 7.84 Surgery 22607 Yes 52.38 4166 2.33 78110 0.82 71.93
Argentina 2003 Alzheimer’s Disease Respiratory 2.71 1.16 6.98 61+ Female 993597 69.29 4.80 1.09 Therapy 22557 No 90.07 3888 8.11 92623 0.83 62.30
Argentina 2004 Alzheimer’s Disease Viral 3.34 14.08 2.02 0-18 Male 991236 79.44 1.33 6.34 Surgery 18506 Yes 66.13 2646 5.26 70858 0.54 84.19
Argentina 2005 Alzheimer’s Disease Genetic 7.95 2.05 6.92 61+ Male 994878 55.81 2.91 3.04 Therapy 8444 Yes 61.27 1902 2.97 70939 0.55 68.29
Argentina 2006 Alzheimer’s Disease Cardiovascular 4.59 5.16 2.99 61+ Other 999475 63.33 4.97 3.40 Medication 14464 Yes 98.18 156 3.38 92196 0.55 33.07
Argentina 2007 Alzheimer’s Disease Neurological 16.36 1.39 3.74 61+ Other 968726 90.92 1.87 2.74 Medication 10705 No 59.49 1924 4.10 40606 0.78 69.94
Argentina 2008 Alzheimer’s Disease Neurological 9.55 4.24 5.27 0-18 Female 982297 68.95 2.43 1.69 Surgery 35053 Yes 82.13 3169 9.67 91423 0.71 78.84
Argentina 2009 Alzheimer’s Disease Autoimmune 11.05 3.43 7.97 36-60 Other 984953 57.13 4.31 5.23 Medication 28014 Yes 92.87 759 2.42 95653 0.75 51.54

Semi-Join Pattern (Keep Only Facts That Have a Match)

Goal: return fact rows that have at least one matching dimension row, without returning dimension columns or duplicating facts.
Canonical SQL: WHERE EXISTS (...) (preferred) or IN (subquery).
Unlike inner joins, a semi-join returns each qualifying fact at most once.

  1. Semi-join via EXISTS ✅ (recommended)
SELECT g.country, g.disease_name, g.year_event, g.population_affected
FROM public.global_health_statistics AS g
WHERE EXISTS (
  SELECT 1
  FROM country_region AS cr
  WHERE cr.country = g.country
)
ORDER BY g.country, g.disease_name, g.year_event
LIMIT 10;
Displaying records 1 - 10
country disease_name year_event population_affected
Australia Alzheimer’s Disease 2000 999987
Australia Alzheimer’s Disease 2001 999380
Australia Alzheimer’s Disease 2002 989422
Australia Alzheimer’s Disease 2003 995752
Australia Alzheimer’s Disease 2004 993018
Australia Alzheimer’s Disease 2005 994235
Australia Alzheimer’s Disease 2006 999998
Australia Alzheimer’s Disease 2007 985789
Australia Alzheimer’s Disease 2008 994427
Australia Alzheimer’s Disease 2009 981466

3. Appending Data (Stacking Rows)

Appending (row-binding) is done using UNION or UNION ALL.

UNION ALL keeps duplicates.

UNION removes duplicates.

-- Append two subsets of the same table (stack rows)
SELECT country, year_event, disease_name, population_affected
FROM public.global_health_statistics
WHERE year_event <= 2015

UNION ALL

SELECT country, year_event, disease_name, population_affected
FROM public.global_health_statistics
WHERE year_event > 2015
ORDER BY year_event
LIMIT 20;
Displaying records 1 - 10
country year_event disease_name population_affected
India 2000 Hypertension 989212
Australia 2000 Rabies 999093
France 2000 COVID-19 996188
Canada 2000 HIV/AIDS 997838
Italy 2000 Tuberculosis 982734
Russia 2000 Ebola 994461
Brazil 2000 Ebola 998677
Russia 2000 Cancer 994540
Australia 2000 HIV/AIDS 998771
Saudi Arabia 2000 Dengue 981698

Part V — Advanced Wrangling

Beyond basic SELECT and JOIN operations, SQL offers powerful tools for advanced wrangling:

  • Common Table Expressions (CTEs) for readability and modular pipelines
  • Window functions for ranking, partitioning, and moving calculations
  • Deduplication using ROW_NUMBER()
  • Upserts for managing incremental updates

These techniques are essential for real-world analytics pipelines.


1. Common Table Expressions (CTEs)

A CTE lets us build queries in readable, stepwise blocks. It’s the SQL equivalent of creating intermediate objects in R with the pipe %>%.

In the example below, the first block (mortality_groups) is like an intermediate dataset. The second query reuses it for further filtering.

-- CTE to calculate mortality categories, then filter
WITH mortality_groups AS (
  SELECT 
      country,
      disease_name,
      year_event,
      mortality_pct,
      CASE 
        WHEN mortality_pct > 9 THEN 'High Mortality'
        ELSE 'Low/Moderate Mortality'
      END AS mortality_group
  FROM public.global_health_statistics
)
SELECT *
FROM mortality_groups
WHERE mortality_group = 'High Mortality'
LIMIT 10;
Displaying records 1 - 10
country disease_name year_event mortality_pct mortality_group
Nigeria Malaria 2002 9.97 High Mortality
South Korea Hepatitis 2024 9.48 High Mortality
Argentina HIV/AIDS 2007 9.65 High Mortality
Brazil Polio 2005 9.60 High Mortality
Australia Malaria 2012 9.50 High Mortality
India Influenza 2002 9.22 High Mortality
Canada Polio 2012 9.91 High Mortality
Italy Malaria 2020 9.24 High Mortality
Russia Malaria 2008 9.72 High Mortality
Nigeria Asthma 2013 9.15 High Mortality

2. Window Functions

Window functions allow calculations across partitions of data without collapsing rows. They are powerful for ranking, moving averages, and comparisons.

Ranking within Groups

-- Rank diseases by affected population per country in 2020
SELECT 
    country,
    disease_name,
    population_affected,
    RANK() OVER (PARTITION BY country ORDER BY population_affected DESC) AS rank_within_country
FROM public.global_health_statistics
WHERE year_event = 2020
ORDER BY country, rank_within_country
LIMIT 10;
Displaying records 1 - 10
country disease_name population_affected rank_within_country
Argentina Cancer 999522 1
Argentina Dengue 997702 2
Argentina COVID-19 997478 3
Argentina Hypertension 997396 4
Argentina Alzheimer’s Disease 995954 5
Argentina Tuberculosis 995534 6
Argentina Ebola 995350 7
Argentina HIV/AIDS 992933 8
Argentina Diabetes 992740 9
Argentina Zika 992414 10

Calculating a rolling 3-year average for selected diseases.

-- 3-year moving average of incidence per disease
SELECT 
    disease_name,
    year_event,
    incidence,
    ROUND(
      (AVG(incidence) OVER (
          PARTITION BY disease_name 
          ORDER BY year_event 
          ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
      ))::numeric, 2
    ) AS incidence_moving_avg
FROM public.global_health_statistics
WHERE disease_name IN ('Malaria','Tuberculosis')
ORDER BY disease_name, year_event
LIMIT 15;
Displaying records 1 - 10
disease_name year_event incidence incidence_moving_avg
Malaria 2000 14.50 14.50
Malaria 2000 12.62 13.56
Malaria 2000 8.22 11.78
Malaria 2000 5.83 8.89
Malaria 2000 3.87 5.97
Malaria 2000 12.26 7.32
Malaria 2000 6.30 7.48
Malaria 2000 4.53 7.70
Malaria 2000 4.33 5.05
Malaria 2000 1.99 3.62

3. Deduplication with ROW_NUMBER()

Sometimes datasets contain duplicate records. We can use ROW_NUMBER() to keep the “first” occurrence and drop the rest.

Identify duplicates across all columns

The example below will show you which rows are duplicated, and how many times.

For example, if the same country + disease + year + population row appears 3 times, you’ll see duplicate_count = 3.

SELECT 
    country,
    disease_name,
    year_event,
    population_affected,
    COUNT(*) AS duplicate_count
FROM public.global_health_statistics
GROUP BY 
    country,
    disease_name,
    year_event,
    population_affected
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC
LIMIT 10;
0 records
country disease_name year_event population_affected duplicate_count

If you want to remove duplicates, keeping only one record per case

-- This removes duplicates across all columns.
SELECT DISTINCT *
FROM public.global_health_statistics
LIMIT 15;
Displaying records 1 - 10
country year_event disease_name disease_category prevalence incidence mortality_pct age_group gender population_affected healthcare_access dr_per_1000 hospital_beds_per_1000 treatment_type average_treatment_cost_usd availability_of_vaccines_treatment recovery_rate_pct dalys improvement_in_5_years_pct per_capita_income_usd education_index urbanization_rate_pct
Argentina 2000 Alzheimer’s Disease Chronic 11.67 2.09 0.99 19-35 Male 999739 56.45 2.62 0.94 Vaccination 7431 Yes 93.08 2852 2.11 94376 0.45 89.55
Argentina 2000 Asthma Parasitic 6.03 11.82 4.71 36-60 Male 999306 65.53 3.85 2.67 Vaccination 34713 Yes 87.90 2810 0.42 83780 0.76 22.01
Argentina 2000 Cancer Bacterial 19.72 7.65 4.07 61+ Other 980822 57.52 3.96 5.92 Medication 36545 Yes 56.26 726 1.43 86501 0.61 70.75
Argentina 2000 Cholera Infectious 19.31 13.36 6.39 36-60 Other 991759 64.20 0.87 7.23 Therapy 5415 No 51.64 4251 7.54 95337 0.81 83.87
Argentina 2000 COVID-19 Genetic 5.87 4.43 3.91 36-60 Female 996726 74.90 4.33 0.79 Surgery 48595 No 84.81 494 0.13 64551 0.41 46.80
Argentina 2000 Dengue Chronic 6.42 13.52 4.60 61+ Female 997142 69.02 1.13 4.69 Therapy 15127 Yes 53.30 4906 5.08 98961 0.51 40.18
Argentina 2000 Diabetes Viral 9.74 4.52 1.08 0-18 Other 996077 60.79 3.10 2.41 Medication 43795 No 56.47 4572 4.81 75035 0.80 68.70
Argentina 2000 Ebola Neurological 11.57 2.21 7.15 61+ Female 996462 68.50 1.44 9.78 Medication 33639 No 98.32 4943 8.77 21021 0.63 74.10
Argentina 2000 Hepatitis Viral 6.85 6.60 4.23 61+ Other 975406 83.04 3.48 3.67 Vaccination 17061 No 73.33 1911 3.90 37765 0.71 48.95
Argentina 2000 HIV/AIDS Autoimmune 7.45 4.96 0.69 36-60 Female 996694 76.82 3.45 3.63 Medication 34932 Yes 78.58 3455 6.66 40875 0.77 89.73

If you only want to de-duplicate on some columns (e.g., country, disease_name, year_event), then use DISTINCT ON (Postgres-only):

SELECT DISTINCT ON (country, disease_name, year_event) *
FROM public.global_health_statistics
ORDER BY country, disease_name, year_event, population_affected DESC
LIMIT 15;
Displaying records 1 - 10
country year_event disease_name disease_category prevalence incidence mortality_pct age_group gender population_affected healthcare_access dr_per_1000 hospital_beds_per_1000 treatment_type average_treatment_cost_usd availability_of_vaccines_treatment recovery_rate_pct dalys improvement_in_5_years_pct per_capita_income_usd education_index urbanization_rate_pct
Argentina 2000 Alzheimer’s Disease Chronic 11.67 2.09 0.99 19-35 Male 999739 56.45 2.62 0.94 Vaccination 7431 Yes 93.08 2852 2.11 94376 0.45 89.55
Argentina 2001 Alzheimer’s Disease Genetic 7.53 1.16 1.60 61+ Female 995329 55.56 2.92 9.01 Therapy 22389 No 89.26 2828 9.99 32326 0.71 51.91
Argentina 2002 Alzheimer’s Disease Bacterial 19.84 2.36 7.98 36-60 Female 999312 80.47 1.66 7.84 Surgery 22607 Yes 52.38 4166 2.33 78110 0.82 71.93
Argentina 2003 Alzheimer’s Disease Respiratory 2.71 1.16 6.98 61+ Female 993597 69.29 4.80 1.09 Therapy 22557 No 90.07 3888 8.11 92623 0.83 62.30
Argentina 2004 Alzheimer’s Disease Viral 3.34 14.08 2.02 0-18 Male 991236 79.44 1.33 6.34 Surgery 18506 Yes 66.13 2646 5.26 70858 0.54 84.19
Argentina 2005 Alzheimer’s Disease Genetic 7.95 2.05 6.92 61+ Male 994878 55.81 2.91 3.04 Therapy 8444 Yes 61.27 1902 2.97 70939 0.55 68.29
Argentina 2006 Alzheimer’s Disease Cardiovascular 4.59 5.16 2.99 61+ Other 999475 63.33 4.97 3.40 Medication 14464 Yes 98.18 156 3.38 92196 0.55 33.07
Argentina 2007 Alzheimer’s Disease Neurological 16.36 1.39 3.74 61+ Other 968726 90.92 1.87 2.74 Medication 10705 No 59.49 1924 4.10 40606 0.78 69.94
Argentina 2008 Alzheimer’s Disease Neurological 9.55 4.24 5.27 0-18 Female 982297 68.95 2.43 1.69 Surgery 35053 Yes 82.13 3169 9.67 91423 0.71 78.84
Argentina 2009 Alzheimer’s Disease Autoimmune 11.05 3.43 7.97 36-60 Other 984953 57.13 4.31 5.23 Medication 28014 Yes 92.87 759 2.42 95653 0.75 51.54

Show the actual duplicate rows (not just counts) This way you actually get the full duplicate rows back, along with a duplicate_count column.

SELECT *
FROM (
    SELECT *,
           COUNT(*) OVER (
               PARTITION BY country, disease_name, year_event, population_affected
           ) AS duplicate_count
    FROM public.global_health_statistics
) t
WHERE duplicate_count > 1
ORDER BY country, disease_name, year_event
LIMIT 10;
0 records
country year_event disease_name disease_category prevalence incidence mortality_pct age_group gender population_affected healthcare_access dr_per_1000 hospital_beds_per_1000 treatment_type average_treatment_cost_usd availability_of_vaccines_treatment recovery_rate_pct dalys improvement_in_5_years_pct per_capita_income_usd education_index urbanization_rate_pct duplicate_count
-- Keep only the latest record per country/disease
WITH ranked AS (
  SELECT 
      country,
      disease_name,
      year_event,
      population_affected,
      ROW_NUMBER() OVER (
          PARTITION BY country, disease_name 
          ORDER BY year_event DESC
      ) AS rn
  FROM public.global_health_statistics
)
SELECT *
FROM ranked
WHERE rn = 1
ORDER BY country, disease_name
LIMIT 20;
Displaying records 1 - 10
country disease_name year_event population_affected rn
Argentina Alzheimer’s Disease 2024 994386 1
Argentina Asthma 2024 997195 1
Argentina Cancer 2024 993535 1
Argentina Cholera 2024 999242 1
Argentina COVID-19 2024 988771 1
Argentina Dengue 2024 988472 1
Argentina Diabetes 2024 987174 1
Argentina Ebola 2024 995657 1
Argentina Hepatitis 2024 990629 1
Argentina HIV/AIDS 2024 989442 1

4. Upserts Overview

In real-world pipelines, we often receive incremental updates (new years, new countries, or corrected values). An upsert (update + insert) ensures the table stays in sync:

If a record exists → update it

If not → insert it

The ON CONFLICT clause ensures data is not duplicated. Instead, existing rows are updated.

Tip: We need to have a unique identifier for each record.

Keep only the latest record per duplicate

DELETE FROM public.global_health_statistics a
USING (
    SELECT ctid, ROW_NUMBER() OVER (
               PARTITION BY country, disease_name, year_event
               ORDER BY population_affected DESC
           ) AS rn
    FROM public.global_health_statistics
) b
WHERE a.ctid = b.ctid AND b.rn > 1;
-- Creating the unique identifier
ALTER TABLE public.global_health_statistics
DROP CONSTRAINT IF EXISTS ghs_unique;
ALTER TABLE public.global_health_statistics
ADD CONSTRAINT ghs_unique UNIQUE (country, year_event, disease_name);
-- Insert or do nothing if the combination already exists
INSERT INTO public.global_health_statistics (
    country, disease_name, year_event
)
VALUES (
    'Australia', 'Influenza', 2022
)
ON CONFLICT (country, disease_name, year_event)
DO NOTHING;

Part VI — Data Quality, Types & Dates

🧭 Overview

High-quality analysis starts with high-quality data.
This section focuses on assessing and improving data integrity, handling missing values, managing data types, and performing date/time transformations.

In practice, data rarely arrives in perfect shape — missing records, inconsistent data types, and messy strings are common.
Being able to diagnose and correct these issues directly in SQL demonstrates the ability to build analysis-ready datasets that decision-makers can trust.

Missing data occurs for many reasons: incomplete records, data-entry errors, or missing identifiers.
Understanding, profiling, and deciding how to handle these missing values are essential steps before any statistical or predictive modeling.

The workflow includes: - Profiling missing data
- Reviewing incomplete records
- Applying imputation (filling) strategies
- Forward and backward filling within patients
- Mean-based imputations

All SQL examples below are written for PostgreSQL but can be adapted easily for SQL Server or SQLite.


1️⃣ Importing and Preparing the Data

To begin, the dataset is imported into PostgreSQL as public.visit_hospital.
This process ensures that the structure matches the source CSV file (C:\temp\visit_hospital.csv).


-- Create table for hospital visit data
DROP TABLE IF EXISTS public.visit_hospital;
CREATE TABLE public.visit_hospital (
  -- Identifiers
  id TEXT, id_text TEXT,
  -- Demographics
  birthdate DATE, deathdate DATE, ssn TEXT, drivers TEXT, passport TEXT,
  prefix TEXT, first TEXT, middle TEXT, last TEXT, suffix TEXT,
  maiden TEXT, marital TEXT, race TEXT, ethnicity TEXT, gender TEXT,  
  -- Location
  birthplace TEXT, address TEXT, city TEXT, state TEXT, county TEXT,
  fips INT, zip INT, lat NUMERIC, lon NUMERIC,  
  -- Healthcare / Socioeconomic
  healthcare_expenses NUMERIC, healthcare_coverage NUMERIC, income NUMERIC,  
  -- Visit metadata
  visit INT
);
-- Stage the raw text date
ALTER TABLE public.visit_hospital ADD COLUMN birthdate_raw TEXT;
-- Clean it into proper DATE
UPDATE public.visit_hospital
SET birthdate = TO_DATE(birthdate_raw, 'DD/MM/YYYY')
WHERE birthdate_raw ~ '^\d{2}/\d{2}/\d{4}$';
-- After loading into a staging text column:
-- Stage the raw text date
ALTER TABLE public.visit_hospital ADD COLUMN IF NOT EXISTS birthdate_raw TEXT;
-- COPY with:
--  - NULL 'NA'      -> turn "NA" into NULL
--  - FORCE_NULL (...) -> also treat empty strings as NULL for listed columns
COPY public.visit_hospital (
  id, birthdate_raw, deathdate, ssn, drivers, passport,
  prefix, first, middle, last, suffix, maiden,
  marital, race, ethnicity, gender,
  birthplace, address, city, state, county,
  fips, zip, lat, lon,
  healthcare_expenses, healthcare_coverage, income,
  visit, id_text
)
FROM 'C:/temp/visit_hospital.csv'
WITH (
  FORMAT CSV,
  HEADER TRUE,
  NULL 'NA',
  FORCE_NULL (birthdate_raw, deathdate, fips, zip, lat, lon,
              healthcare_expenses, healthcare_coverage, income),
  DELIMITER ',',
  QUOTE '"',
  ESCAPE '"',
  ENCODING 'UTF8');
UPDATE public.visit_hospital
SET birthdate = TO_DATE(birthdate_raw, 'DD/MM/YYYY')
WHERE birthdate_raw ~ '^\d{2}/\d{2}/\d{4}$';

2️⃣ Profiling Missing Data

Understanding which columns have missing values — and how many — is the foundation of any data-quality review.

Before fixing missing data, we must understand how much is missing and where. This first step mimics colSums(is.na()) in R — it counts missing values for each important column.

-- Summarize missing values per selected columns
SELECT
  'birthdate' AS column_name, COUNT(*) FILTER (WHERE birthdate IS NULL) AS missing_count FROM public.visit_hospital
UNION ALL SELECT 'deathdate', COUNT(*) FILTER (WHERE deathdate IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'race', COUNT(*) FILTER (WHERE race IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'income', COUNT(*) FILTER (WHERE income IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'passport', COUNT(*) FILTER (WHERE passport IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'fips', COUNT(*) FILTER (WHERE fips IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'zip', COUNT(*) FILTER (WHERE zip IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'healthcare_coverage', COUNT(*) FILTER (WHERE healthcare_coverage IS NULL) FROM public.visit_hospital;
8 records
column_name missing_count
birthdate 16
deathdate 40
race 20
income 20
passport 12
fips 12
zip 12
healthcare_coverage 4

Context: Each row in this summary represents a variable and the number of NULL (missing) entries. This gives you an at-a-glance understanding of where data quality issues exist, and whether they are systematic (entire columns) or random (sporadic records).

To get the total number of missing cells across the dataset:

-- Calculate total number of missing cells across selected columns
SELECT
  SUM(
    (birthdate IS NULL)::INT +
    (deathdate IS NULL)::INT +
    (race IS NULL)::INT +
    (income IS NULL)::INT +
    (passport IS NULL)::INT +
    (fips IS NULL)::INT +
    (zip IS NULL)::INT +
    (healthcare_coverage IS NULL)::INT
  ) AS total_missing_cells
FROM public.visit_hospital;
1 records
total_missing_cells
136

This produces one number — the total missing count — useful for data-quality reporting or automated alerts in data pipelines.

3️⃣ Inspecting Rows with Missing Data

After identifying which columns contain missing values, it’s helpful to inspect a few affected records.

-- Display sample of records that contain missing values
SELECT id_text, visit, race, income, passport
FROM public.visit_hospital
WHERE race IS NULL OR income IS NULL OR passport IS NULL
ORDER BY id_text, visit
LIMIT 12;
Displaying records 1 - 10
id_text visit race income passport
Anit|Sánc|1960-05-07 2 NA NA X9157439X
Anit|Sánc|1960-05-07 3 white NA X9157439X
Anit|Sánc|1960-05-07 4 NA 60016 X9157439X
Aure|Weis|2003-11-12 2 NA NA X17586296X
Aure|Weis|2003-11-12 3 white NA X17586296X
Aure|Weis|2003-11-12 4 NA 7752 X17586296X
Cars|Litt|2012-09-02 1 white 71169 NA
Cars|Litt|2012-09-02 2 NA NA NA
Cars|Litt|2012-09-02 3 NA 70169 NA
Cars|Litt|2012-09-02 4 white NA NA

This query mimics filter(!complete.cases(.)) in R. It reveals the context of missing values — are they clustered by person, visit, or variable? This step helps determine the appropriate imputation method (e.g., individual-level fill vs. global averages).

4️⃣ Handling Missing Values (Imputation)

Imputation means replacing missing data with estimated or default values. This section demonstrates several techniques — from simple replacements to forward/backward fills.

4.1 Simple categorical imputation: Replacement with COALESCE()

-- Replace missing passport values with 'UNKNOWN'
SELECT
  id_text,
  visit,
  passport,
  COALESCE(passport, 'UNKNOWN') AS passport_imputed
FROM public.visit_hospital
WHERE passport IS NULL
LIMIT 12;
Displaying records 1 - 10
id_text visit passport passport_imputed
Cars|Litt|2012-09-02 1 NA UNKNOWN
Doll|Fran|2008-10-04 1 NA UNKNOWN
Cars|Litt|2012-09-02 2 NA UNKNOWN
Doll|Fran|2008-10-04 2 NA UNKNOWN
Cars|Litt|2012-09-02 3 NA UNKNOWN
Doll|Fran|2008-10-04 3 NA UNKNOWN
Cars|Litt|2012-09-02 4 NA UNKNOWN
Doll|Fran|2008-10-04 4 NA UNKNOWN
Jimm|Doyl|2005-04-25 1 NA UNKNOWN
Jimm|Doyl|2005-04-25 2 NA UNKNOWN

COALESCE() replaces NULL with a specified value. For example, unknown passport numbers might represent unregistered patients — instead of leaving them blank, we explicitly mark them as “UNKNOWN”. This improves downstream clarity and avoids confusion between “missing” and “not applicable.”

4.2 Numeric imputations

Global mean: stable when per-ID history is sparse. Participant mean: more tailored when individuals have multiple visits.

4.2.1 Global Mean Imputation for Numeric Data

-- Compute overall average income (ignoring missing values)
SELECT AVG(income) AS global_mean_income
FROM public.visit_hospital
WHERE income IS NOT NULL;
1 records
global_mean_income
96123.1
-- Replace missing income values with global mean
SELECT
  id_text,
  visit,
  income,
  COALESCE(income, (SELECT AVG(income) FROM public.visit_hospital WHERE income IS NOT NULL)) AS income_imputed_global
FROM public.visit_hospital
ORDER BY id_text, visit
LIMIT 12;
Displaying records 1 - 10
id_text visit income income_imputed_global
Anit|Sánc|1960-05-07 1 61016 61016.0
Anit|Sánc|1960-05-07 2 NA 96123.1
Anit|Sánc|1960-05-07 3 NA 96123.1
Anit|Sánc|1960-05-07 4 60016 60016.0
Aure|Weis|2003-11-12 1 8752 8752.0
Aure|Weis|2003-11-12 2 NA 96123.1
Aure|Weis|2003-11-12 3 NA 96123.1
Aure|Weis|2003-11-12 4 7752 7752.0
Cars|Litt|2012-09-02 1 71169 71169.0
Cars|Litt|2012-09-02 2 NA 96123.1

This replaces missing income values with the overall mean. While not ideal for predictive models (since it reduces variance), it provides a consistent baseline for descriptive summaries and visualization readiness.

4.2.2 Group-Level (Participant) Mean Imputation A more refined approach is to use each patient’s own average income (if available). This reduces bias when patients vary substantially in income levels.

To better respect within-person variation, we calculate mean income for each participant (id_text) and use that as a fill value.

-- Compute mean income per patient
WITH participant_mean AS (
  SELECT id_text, AVG(income) AS mean_income
  FROM public.visit_hospital
  WHERE income IS NOT NULL
  GROUP BY id_text
)
SELECT
  v.id_text,
  v.visit,
  v.income,
  COALESCE(v.income, p.mean_income) AS income_imputed_participant
FROM public.visit_hospital v
LEFT JOIN participant_mean p
  ON v.id_text = p.id_text
ORDER BY v.id_text, v.visit
LIMIT 12;
Displaying records 1 - 10
id_text visit income income_imputed_participant
Anit|Sánc|1960-05-07 1 61016 61016
Anit|Sánc|1960-05-07 2 NA 60516
Anit|Sánc|1960-05-07 3 NA 60516
Anit|Sánc|1960-05-07 4 60016 60016
Aure|Weis|2003-11-12 1 8752 8752
Aure|Weis|2003-11-12 2 NA 8252
Aure|Weis|2003-11-12 3 NA 8252
Aure|Weis|2003-11-12 4 7752 7752
Cars|Litt|2012-09-02 1 71169 71169
Cars|Litt|2012-09-02 2 NA 70669

Context: This technique aligns with R’s group_by(id_text) %>% summarise(mean(income, na.rm = TRUE)). It produces personalized imputations and avoids using global means that may misrepresent individual economic patterns.

4.3 Forward and Backward Fill (Sequential Imputation)

When data is longitudinal (multiple visits per patient), we may want to carry values forward or backward. SQL does not have a native fill() like R’s tidyr, but sorting prepares the data for analytic solutions.

Forward and backward fills propagate the most recent known value across sequential observations — useful when attributes like race or gender should remain constant over multiple visits.

Forward filling uses the last known race value, while backward filling retrieves the next known one. These techniques are valuable when longitudinal patient data should remain internally consistent.

Why both? In visit data, the first or last known race may be the only non-missing value for a participant. Having forward and backward fills lets me choose the nearest known value relative to each visit.

Forward Fill (by id_text, ordered by visit)

-- Forward fill race within each id_text
WITH base AS (
  SELECT
    id_text,
    visit,
    race,
    -- running group: increases when we hit a non-NULL race
    SUM(CASE WHEN race IS NOT NULL THEN 1 ELSE 0 END)
      OVER (PARTITION BY id_text ORDER BY visit
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS grp
  FROM public.visit_hospital
)
SELECT
  id_text,
  visit,
  race,
  -- carry last known non-NULL forward; stays NULL until first non-NULL appears
  MAX(race) OVER (PARTITION BY id_text, grp) AS race_forward
FROM base
ORDER BY id_text, visit
LIMIT 12;
Displaying records 1 - 10
id_text visit race race_forward
Anit|Sánc|1960-05-07 1 white white
Anit|Sánc|1960-05-07 2 NA white
Anit|Sánc|1960-05-07 3 white white
Anit|Sánc|1960-05-07 4 NA white
Aure|Weis|2003-11-12 1 white white
Aure|Weis|2003-11-12 2 NA white
Aure|Weis|2003-11-12 3 white white
Aure|Weis|2003-11-12 4 NA white
Cars|Litt|2012-09-02 1 white white
Cars|Litt|2012-09-02 2 NA white

Backward Fill

-- Backward fill missing race values by patient

WITH base AS (
  SELECT
    id_text,
    visit,
    race,
    -- running group from the right: increases when we hit a non-NULL race scanning backwards
    SUM(CASE WHEN race IS NOT NULL THEN 1 ELSE 0 END)
      OVER (PARTITION BY id_text ORDER BY visit DESC
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS grp_rev
  FROM public.visit_hospital
)
SELECT
  id_text,
  visit,
  race,
  MAX(race) OVER (PARTITION BY id_text, grp_rev) AS race_backward
FROM base
ORDER BY id_text, visit
LIMIT 12;
Displaying records 1 - 10
id_text visit race race_backward
Anit|Sánc|1960-05-07 1 white white
Anit|Sánc|1960-05-07 2 NA white
Anit|Sánc|1960-05-07 3 white white
Anit|Sánc|1960-05-07 4 NA NA
Aure|Weis|2003-11-12 1 white white
Aure|Weis|2003-11-12 2 NA white
Aure|Weis|2003-11-12 3 white white
Aure|Weis|2003-11-12 4 NA NA
Cars|Litt|2012-09-02 1 white white
Cars|Litt|2012-09-02 2 NA white

4.4 Combined imputation review (side-by-side) The example below shows a practical “imputation report”: original vs. forward/backward race and both income fills, restricted to rows that actually had gaps. This gives reviewers a one-glance sanity check.


WITH base AS (
  SELECT
    id_text, visit, race, income
  FROM public.visit_hospital
),
-- forward-fill race within id_text
ff_prep AS (
  SELECT *,
         SUM((race IS NOT NULL)::int)
           OVER (PARTITION BY id_text ORDER BY visit) AS grp
  FROM base
),
fwd AS (
  SELECT id_text, visit,
         MAX(race) OVER (PARTITION BY id_text, grp) AS race_forward
  FROM ff_prep
),
-- backward-fill race within id_text (for leading NULLs)
bf_prep AS (
  SELECT *,
         SUM((race IS NOT NULL)::int)
           OVER (PARTITION BY id_text ORDER BY visit DESC) AS grp_rev
  FROM base
),
bwd AS (
  SELECT id_text, visit,
         MAX(race) OVER (PARTITION BY id_text, grp_rev) AS race_backward
  FROM bf_prep
),
-- means for income imputation
gmean AS (
  SELECT AVG(income) AS global_mean_income
  FROM public.visit_hospital
  WHERE income IS NOT NULL
),
pmean AS (
  SELECT id_text, AVG(income) AS mean_income_id
  FROM public.visit_hospital
  WHERE income IS NOT NULL
  GROUP BY id_text
)
SELECT
  v.id_text, v.visit,
  v.race,
  f.race_forward,
  b.race_backward,
  v.income,
  COALESCE(v.income, g.global_mean_income) AS income_global_mean,
  COALESCE(v.income, p.mean_income_id)     AS income_participant_mean
FROM public.visit_hospital v
LEFT JOIN fwd  f USING (id_text, visit)
LEFT JOIN bwd  b USING (id_text, visit)
CROSS JOIN gmean g
LEFT JOIN pmean p USING (id_text)
WHERE v.race IS NULL OR v.income IS NULL
ORDER BY v.id_text, v.visit NULLS FIRST
LIMIT 15;
Displaying records 1 - 10
id_text visit race race_forward race_backward income income_global_mean income_participant_mean
Anit|Sánc|1960-05-07 2 NA white white NA 96123.1 60516
Anit|Sánc|1960-05-07 3 white white white NA 96123.1 60516
Anit|Sánc|1960-05-07 4 NA white NA 60016 60016.0 60016
Aure|Weis|2003-11-12 2 NA white white NA 96123.1 8252
Aure|Weis|2003-11-12 3 white white white NA 96123.1 8252
Aure|Weis|2003-11-12 4 NA white NA 7752 7752.0 7752
Cars|Litt|2012-09-02 2 NA white white NA 96123.1 70669
Cars|Litt|2012-09-02 3 NA white white 70169 70169.0 70169
Cars|Litt|2012-09-02 4 white white white NA 96123.1 70669
Doll|Fran|2008-10-04 2 NA white white NA 96123.1 187523

🧮 Summary Table of Missing Data Techniques

Task / Objective SQL Feature or Function Purpose / What It Does When to Use It
Profile Missing Data COUNT(*) FILTER (WHERE col IS NULL) Counts how many missing (NULL) entries exist in each column Always the first step in a data-quality audit
Check Total Missing Cells Logical addition (col IS NULL)::INT summed across columns Produces total missing count across all variables Useful for overall data completeness score
Inspect Incomplete Records WHERE col IS NULL OR ... Shows which specific rows have missing values Use to detect patterns in missingness (e.g., per patient or visit)
Replace Missing Text Values COALESCE(column, 'UNKNOWN') Fills missing categorical data with an explicit label Good for identifiers or attributes like passport, gender, or ethnicity
Replace Missing Numeric Values (Global Mean) COALESCE(income, (SELECT AVG(income) FROM ...)) Fills missing numeric fields with a global average Simple and consistent for reporting, though may reduce data variance
Replace Missing Numeric Values (Group Mean) LEFT JOIN on grouped AVG() Calculates and fills missing values with participant-specific means More realistic for longitudinal or panel data (e.g., income per patient)
Forward Fill (Downward) LAST_VALUE(col) OVER (PARTITION BY id ORDER BY time) Propagates the last known non-null value forward Use when variables remain stable over time (e.g., race, gender)
Backward Fill (Upward) FIRST_VALUE(col) OVER (PARTITION BY id ORDER BY time ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) Fills missing values upward (from next known observation) Complements forward fill for complete coverage
Type Casting for Numeric Fields CAST(col AS NUMERIC) Converts text-encoded numbers to true numeric type Needed before aggregation, averages, or mathematical operations
String Cleaning (Regex) REGEXP_REPLACE(text, '[^A-Za-z\\s]', '', 'g') Removes non-letter characters or formatting noise Prepares identifiers or names for joins and standardization
Date Conversion TO_DATE(col, 'YYYY-MM-DD') Converts character dates to date objects Enables date arithmetic, filtering, and time-based grouping
Date Component Extraction EXTRACT(YEAR FROM date_col) Retrieves year, month, or day components Supports monthly/annual trend analysis
Filtering Complete Cases WHERE col1 IS NOT NULL AND col2 IS NOT NULL Keeps only fully complete rows for analysis Appropriate for strict models or visualizations that can’t handle nulls

Analytical Insights

  • Profiling comes before cleaning. Always understand the pattern and reason behind missingness before applying fixes.
  • Context drives imputation choice. Not every missing value should be replaced — sometimes absence itself carries information (e.g., missing death date = still alive).
  • Group-based imputations preserve individual structure. Replacing income within each participant group respects the natural data hierarchy.
  • Sequential fills replicate temporal logic. Forward and backward filling are essential for visit-based or longitudinal data where attributes should persist across time.
  • Documentation ensures transparency. Each step in the SQL workflow can be reviewed, audited, or reproduced by other analysts.

Data Casting & Cleaning (Regex, Numeric/Text Conversion)

Real-world datasets often contain inconsistent or improperly formatted data — text in numeric fields, extra spaces, mixed casing, or special characters in names. These issues arise during data collection, system exports, or file conversions, and they can silently break joins, aggregations, or numeric calculations.

This section focuses on casting, cleaning, and standardizing key variables in the visit_hospital dataset.
We’ll work through examples that show how SQL handles type conversion, regex-based cleanup, and normalization of text fields.


1️⃣ Understanding Data Type Inconsistencie

Before fixing, it’s important to detect type or formatting inconsistencies.
For example, numeric columns may contain text artifacts ('N/A', 'unknown', '#NULL!'), and text columns may contain hidden characters or mixed case.

-- Preview some raw data for inspection
SELECT id_text, income, fips, zip, passport
FROM public.visit_hospital
LIMIT 12;
Displaying records 1 - 10
id_text income fips zip passport
Hatt|Nade|1986-07-04 86384 25017 2468 X61479569X
Cars|Litt|2012-09-02 71169 44007 2861 NA
Anit|Sánc|1960-05-07 61016 25009 1915 X9157439X
Doll|Fran|2008-10-04 188023 NA NA NA
Hatt|Nade|1986-07-04 NA 25017 2468 X61479569X
Cars|Litt|2012-09-02 NA 44007 2861 NA
Anit|Sánc|1960-05-07 NA 25009 1915 X9157439X
Doll|Fran|2008-10-04 NA NA NA NA
Hatt|Nade|1986-07-04 85384 25017 2468 X61479569X
Cars|Litt|2012-09-02 70169 44007 2861 NA

This step gives a sense of what we’re dealing. For example look if income which is a numeric variable contains null strings, or fips and zip contain unexpected symbols or missing digits. Knowing what’s wrong guides the cleaning strategy.

– Identify non-numeric entries in columns expected to be numeric ALTER TABLE public.visit_hospital ADD COLUMN IF NOT EXISTS income_raw TEXT, ADD COLUMN IF NOT EXISTS zip_raw TEXT, ADD COLUMN IF NOT EXISTS fips_raw TEXT, ADD COLUMN IF NOT EXISTS marital_raw TEXT, ADD COLUMN IF NOT EXISTS city_raw TEXT, ADD COLUMN IF NOT EXISTS passport_raw TEXT;

UPDATE public.visit_hospital SET income_raw = income::text, zip_raw = zip::text, fips_raw = fips::text, marital_raw = marital, city_raw = city, passport_raw = passport;

– Generate a numeric “row number” to use with % WITH numbered AS ( SELECT id, row_number() OVER (ORDER BY id) AS rn FROM public.visit_hospital ) UPDATE public.visit_hospital AS v SET income_raw = CASE WHEN n.rn % 10 = 0 THEN ‘N/A’ WHEN n.rn % 15 = 0 THEN ‘#NULL!’ ELSE v.income_raw END, zip_raw = CASE WHEN n.rn % 12 = 0 THEN ‘123’ WHEN n.rn % 13 = 0 THEN ‘12-345’ WHEN n.rn % 14 = 0 THEN ‘A12B3’ ELSE v.zip_raw END, fips_raw = CASE WHEN n.rn % 11 = 0 THEN ‘abc12’ WHEN n.rn % 17 = 0 THEN ‘1234’ ELSE v.fips_raw END, marital_raw = CASE WHEN n.rn % 8 = 0 THEN ‘Single’ WHEN n.rn % 9 = 0 THEN ‘single’ WHEN n.rn % 10 = 0 THEN ‘SINGLE’ ELSE v.marital_raw END, city_raw = CASE WHEN n.rn % 20 = 0 THEN ’ new york ’ WHEN n.rn % 21 = 0 THEN ‘CH!CAGO’ WHEN n.rn % 22 = 0 THEN ‘los_angeles’ ELSE v.city_raw END, passport_raw = CASE WHEN n.rn % 25 = 0 THEN ’’ WHEN n.rn % 26 = 0 THEN NULL ELSE v.passport_raw END FROM numbered AS n WHERE v.id = n.id;

2️⃣ Casting Numeric-Like Columns If a column expected to be numeric contains non-numeric text, direct aggregations like SUM() or AVG() will fail. We can safely convert these columns by filtering or replacing invalid entries.

2.1 Checking Invalid Numeric Values

-- Identify non-numeric entries in columns expected to be numeric
SELECT id_text, income
FROM public.visit_hospital
WHERE income::text !~ '^[0-9]+(\.[0-9]+)?$'
  AND income IS NOT NULL
LIMIT 10;
0 records
id_text income

The regex ‘^-?(.)?$’ detects only valid integers or decimals. Anything else (e.g., ‘N/A’, ‘#NULL!’, ‘unknown’) is non-numeric. The !~ operator in PostgreSQL checks for values that do not match the regex.

2.2 Convert to Numeric Safely

SELECT
  id_text,
  income,
  income AS income_cleaned
FROM public.visit_hospital
LIMIT 12;
Displaying records 1 - 10
id_text income income_cleaned
Hatt|Nade|1986-07-04 86384 86384
Cars|Litt|2012-09-02 71169 71169
Anit|Sánc|1960-05-07 61016 61016
Doll|Fran|2008-10-04 188023 188023
Hatt|Nade|1986-07-04 NA NA
Cars|Litt|2012-09-02 NA NA
Anit|Sánc|1960-05-07 NA NA
Doll|Fran|2008-10-04 NA NA
Hatt|Nade|1986-07-04 85384 85384
Cars|Litt|2012-09-02 70169 70169

This avoids errors by casting only valid numeric entries. Invalid strings are replaced with NULL, allowing numeric aggregations to run safely.

3️⃣ Text Cleaning with Regex and Standardization

3.1 Clean City and Name Fields

ALTER TABLE public.visit_hospital
  ADD COLUMN IF NOT EXISTS city_raw TEXT;
UPDATE public.visit_hospital
SET city_raw = city;
-- Remove special characters and extra spaces from text
SELECT
  id_text,
  city_raw,
  REGEXP_REPLACE(TRIM(city_raw), '[^A-Za-z\\s]', '', 'g') AS city_cleaned
FROM public.visit_hospital
LIMIT 12;
Displaying records 1 - 10
id_text city_raw city_cleaned
Hatt|Nade|1986-07-04 Newton Newton
Cars|Litt|2012-09-02 Attleboro Attleboro
Anit|Sánc|1960-05-07 Beverly Beverly
Doll|Fran|2008-10-04 Shrewsbury Shrewsbury
Hatt|Nade|1986-07-04 Newton Newton
Cars|Litt|2012-09-02 Attleboro Attleboro
Anit|Sánc|1960-05-07 Beverly Beverly
Doll|Fran|2008-10-04 Shrewsbury Shrewsbury
Hatt|Nade|1986-07-04 Newton Newton
Cars|Litt|2012-09-02 Attleboro Attleboro

TRIM() removes leading and trailing spaces.

REGEXP_REPLACE() strips non-alphabetic characters (e.g., “CH!CAGO” → “CHCAGO”).

These techniques clean text for better grouping and joining.

3.2 Standardize Text Casing

-- Normalize categorical text casing
SELECT
  id_text,
  marital,
  LOWER(marital) AS marital_lower,
  INITCAP(LOWER(marital)) AS marital_title
FROM public.visit_hospital
LIMIT 12;
Displaying records 1 - 10
id_text marital marital_lower marital_title
Hatt|Nade|1986-07-04 M m M
Cars|Litt|2012-09-02 NA NA NA
Anit|Sánc|1960-05-07 W w W
Doll|Fran|2008-10-04 NA NA NA
Hatt|Nade|1986-07-04 M m M
Cars|Litt|2012-09-02 NA NA NA
Anit|Sánc|1960-05-07 W w W
Doll|Fran|2008-10-04 NA NA NA
Hatt|Nade|1986-07-04 M m M
Cars|Litt|2012-09-02 NA NA NA

Mixed casing can create false duplicates (“Single”, “single”, “SINGLE”). LOWER() ensures uniformity for analysis, while INITCAP() is ideal for presentation-ready formats.

4️⃣ Cleaning and Padding Identifiers

4.1 ZIP Code Cleaning

ALTER TABLE public.visit_hospital
  ADD COLUMN IF NOT EXISTS zip_raw TEXT;
UPDATE public.visit_hospital
SET zip_raw = zip::text
WHERE zip_raw IS NULL;
-- Remove non-numeric characters and pad to 5 digits
SELECT
  id_text,
  zip_raw,
  LPAD(REGEXP_REPLACE(zip_raw, '[^0-9]', '', 'g'), 5, '0') AS zip_cleaned
FROM public.visit_hospital
LIMIT 12;
Displaying records 1 - 10
id_text zip_raw zip_cleaned
Hatt|Nade|1986-07-04 2468 02468
Cars|Litt|2012-09-02 2861 02861
Anit|Sánc|1960-05-07 1915 01915
Doll|Fran|2008-10-04 NA NA
Hatt|Nade|1986-07-04 2468 02468
Cars|Litt|2012-09-02 2861 02861
Cars|Litt|2012-09-02 2861 02861
Anit|Sánc|1960-05-07 1915 01915
Doll|Fran|2008-10-04 NA NA
Hatt|Nade|1986-07-04 2468 02468

REGEXP_REPLACE() removes dashes, letters, and spaces; LPAD() ensures all ZIP codes have five digits (‘123’ → ‘00123’).

4.2 FIPS Code Cleaning

ALTER TABLE public.visit_hospital
  ADD COLUMN IF NOT EXISTS fips_raw TEXT;
UPDATE public.visit_hospital
SET fips_raw = fips::text
WHERE fips_raw IS NULL;
-- Clean and normalize FIPS codes
SELECT
  id_text,
  fips_raw,
  LPAD(REGEXP_REPLACE(fips_raw, '[^0-9]', '', 'g'), 5, '0') AS fips_cleaned
FROM public.visit_hospital
LIMIT 12;
Displaying records 1 - 10
id_text fips_raw fips_cleaned
Hatt|Nade|1986-07-04 25017 25017
Cars|Litt|2012-09-02 44007 44007
Anit|Sánc|1960-05-07 25009 25009
Doll|Fran|2008-10-04 NA NA
Hatt|Nade|1986-07-04 25017 25017
Cars|Litt|2012-09-02 44007 44007
Cars|Litt|2012-09-02 44007 44007
Anit|Sánc|1960-05-07 25009 25009
Doll|Fran|2008-10-04 NA NA
Hatt|Nade|1986-07-04 25017 25017

Standardized FIPS codes ensure geographic consistency. Removing non-numeric characters prevents join failures when linking with regional or census datasets.

5️⃣ Categorical & Identifier Normalization

ALTER TABLE public.visit_hospital
  ADD COLUMN IF NOT EXISTS marital_raw TEXT;
UPDATE public.visit_hospital
SET marital_raw = marital
WHERE marital_raw IS NULL;
-- Identify distinct marital variants
SELECT DISTINCT marital_raw
FROM public.visit_hospital
ORDER BY marital_raw;
5 records
marital_raw
D
M
S
W
NA

Quickly reveals variations such as “single”, “Single”, and “SINGLE” — common in manually entered survey or registry data.

6️⃣ Before vs After: Data Cleaning Summary View

The next query displays raw vs. cleaned columns side-by-side, allowing a clear visual comparison for documentation or dashboards.

-- Add raw text versions of numeric/text fields for cleaning
ALTER TABLE public.visit_hospital
  ADD COLUMN IF NOT EXISTS income_raw   TEXT,
  ADD COLUMN IF NOT EXISTS zip_raw      TEXT,
  ADD COLUMN IF NOT EXISTS fips_raw     TEXT,
  ADD COLUMN IF NOT EXISTS marital_raw  TEXT,
  ADD COLUMN IF NOT EXISTS city_raw     TEXT,
  ADD COLUMN IF NOT EXISTS passport_raw TEXT;
  
-- Populate them from existing columns
UPDATE public.visit_hospital
SET
  income_raw   = income::text,
  zip_raw      = zip::text,
  fips_raw     = fips::text,
  marital_raw  = marital,
  city_raw     = city,
  passport_raw = passport
WHERE income_raw   IS NULL
   OR zip_raw      IS NULL
   OR fips_raw     IS NULL
   OR marital_raw  IS NULL
   OR city_raw     IS NULL
   OR passport_raw IS NULL;
-- Create a clean comparison view of selected variables
SELECT
  id_text,
  income_raw,
  CASE
    WHEN income_raw ~ '^\s*-?\d+(\.\d+)?\s*$'
    THEN CAST(TRIM(income_raw) AS NUMERIC)
    ELSE NULL
  END AS income_cleaned,
  zip_raw,
  LPAD(REGEXP_REPLACE(zip_raw, '[^0-9]', '', 'g'), 5, '0') AS zip_cleaned,
  fips_raw,
  LPAD(REGEXP_REPLACE(fips_raw, '[^0-9]', '', 'g'), 5, '0') AS fips_cleaned,
  marital_raw,
  INITCAP(LOWER(marital_raw)) AS marital_cleaned,
  city_raw,
  REGEXP_REPLACE(TRIM(city_raw), '[^A-Za-z\\s]', '', 'g') AS city_cleaned,
  passport_raw,
  COALESCE(NULLIF(TRIM(passport_raw), ''), 'UNKNOWN') AS passport_cleaned
FROM public.visit_hospital
ORDER BY id_text
LIMIT 20;
Displaying records 1 - 10
id_text income_raw income_cleaned zip_raw zip_cleaned fips_raw fips_cleaned marital_raw marital_cleaned city_raw city_cleaned passport_raw passport_cleaned
Anit|Sánc|1960-05-07 NA NA 1915 01915 25009 25009 W W Beverly Beverly X9157439X X9157439X
Anit|Sánc|1960-05-07 61016 61016 1915 01915 25009 25009 W W Beverly Beverly X9157439X X9157439X
Anit|Sánc|1960-05-07 60016 60016 1915 01915 25009 25009 W W Beverly Beverly X9157439X X9157439X
Anit|Sánc|1960-05-07 NA NA 1915 01915 25009 25009 W W Beverly Beverly X9157439X X9157439X
Aure|Weis|2003-11-12 NA NA 1606 01606 25027 25027 NA NA Worcester Worcester X17586296X X17586296X
Aure|Weis|2003-11-12 7752 7752 1606 01606 25027 25027 NA NA Worcester Worcester X17586296X X17586296X
Aure|Weis|2003-11-12 8752 8752 1606 01606 25027 25027 NA NA Worcester Worcester X17586296X X17586296X
Aure|Weis|2003-11-12 NA NA 1606 01606 25027 25027 NA NA Worcester Worcester X17586296X X17586296X
Cars|Litt|2012-09-02 71169 71169 2861 02861 44007 44007 NA NA Attleboro Attleboro NA UNKNOWN
Cars|Litt|2012-09-02 NA NA 2861 02861 44007 44007 NA NA Attleboro Attleboro NA UNKNOWN

This “before and after” display is ideal for portfolio demonstrations — clearly showing how SQL transformations standardize messy data.

🧮 Summary of Techniques and Insights

Task / Goal SQL Technique Purpose / What It Fixes Analytical Context / When to Use
Detect invalid numeric entries !~ '^\s*-?\d+(\.\d+)?\s*$' Finds non-numeric strings in numeric columns Ensures numeric operations (e.g., AVG, SUM) don’t fail or misreport
Safe numeric casting CASE WHEN col ~ regex THEN CAST(col AS NUMERIC) Converts only valid entries to numeric Prevents casting errors during data type conversions
Trim and clean text fields TRIM() Removes extra spaces from names or city fields Useful for manual-entry data or multi-system merges
Remove unwanted characters REGEXP_REPLACE(col, '[^A-Za-z\\s]', '', 'g') Cleans text of digits, punctuation, and symbols Standardizes strings for grouping, joining, or machine learning models
Normalize text casing LOWER(), UPPER(), INITCAP() Converts inconsistent casing to standard formats Ensures accurate grouping and consistent visual presentation
Clean ZIP and FIPS codes REGEXP_REPLACE() + LPAD() Removes non-numeric characters and pads missing zeros Prepares identifiers for geographic joins and validation
Handle blank or missing categorical fields COALESCE(NULLIF(TRIM(col), ''), 'UNKNOWN') Converts blank text and NULLs to clear labels Makes categorical variables explicit for dashboards and reporting
Detect inconsistent category variants SELECT DISTINCT LOWER(col) Finds duplicated categories differing only by case Helps consolidate values before aggregation
Preview before vs. after cleaning Raw vs. cleaned comparison query Shows immediate data improvements Ideal for portfolio documentation and QA verification

Key Takeaways

Data type consistency ensures that every column supports its intended operation (numeric → arithmetic, text → joins).

Regex-based cleaning enables flexible and repeatable correction of messy or user-entered data.

Case and category normalization eliminate hidden duplicates, improving grouping accuracy.

Padding identifiers keeps cross-referenced codes (like ZIP or FIPS) aligned with external sources.

Blank-field handling with COALESCE() ensures clarity between “missing” and “not applicable.”

SQL-first cleaning supports reproducibility and scalability — ideal for enterprise pipelines and analytical environments.

Date & Time Basics (conversion, extraction, validation)

Dates and times are among the most critical yet error-prone data types in analytics.
Whether you’re working with patient visits, transactions, or time-series indicators, proper handling of temporal data ensures analytical accuracy and chronological integrity.

This section demonstrates key SQL techniques for date conversion, validation, and temporal analysis using the visit_hospital dataset.

Inspecting and Profiling Date Columns

The dataset includes birthdate, deathdate, and a visit number field (visit).
Let’s quickly inspect their contents and verify their types.

-- Review the date columns to understand their current formats
SELECT
  id_text,
  birthdate,
  deathdate,
  visit
FROM public.visit_hospital
LIMIT 10;
Displaying records 1 - 10
id_text birthdate deathdate visit
Hatt|Nade|1986-07-04 NA NA 1
Cars|Litt|2012-09-02 NA NA 1
Anit|Sánc|1960-05-07 NA NA 1
Doll|Fran|2008-10-04 NA NA 1
Cars|Litt|2012-09-02 NA NA 2
Cars|Litt|2012-09-02 NA NA 3
Anit|Sánc|1960-05-07 NA NA 3
Doll|Fran|2008-10-04 NA NA 3
Hatt|Nade|1986-07-04 NA NA 4
Cars|Litt|2012-09-02 NA NA 4

Creating Raw Text Versions of Date Fields

This initial inspection gives a quick sense of what we’re working with — typically clean and consistent at this stage.

Detect Invalid or Impossible Dates

A common quality check involves identifying logically impossible dates, such as birthdates in the future or deathdates earlier than birthdates.

-- Classify raw dates by validity and pattern
WITH parts AS (
  SELECT
    id_text,
    birthdate_raw,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 1)::int
    END AS year,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 2)::int
    END AS month,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 3)::int
    END AS day
  FROM public.visit_hospital
),
classified AS (
  SELECT
    id_text,
    birthdate_raw,
    CASE
      WHEN year IS NOT NULL THEN
        CASE
          WHEN month BETWEEN 1 AND 12
           AND day   BETWEEN 1 AND 31
          THEN 'Valid YYYY-MM-DD'
          ELSE 'Invalid YYYY-MM-DD'
        END
      ELSE 'Other/Unknown Format'
    END AS pattern
  FROM parts
)
SELECT pattern, COUNT(*) AS record_count
FROM classified
GROUP BY pattern
ORDER BY record_count DESC;
3 records
pattern record_count
Other/Unknown Format 32
Valid YYYY-MM-DD 4
Invalid YYYY-MM-DD 4

This profiling identifies all structural variations. Typical outputs include YYYY-MM-DD, YYYY/MM/DD, DD-MM-YYYY, ISO with time, and residual invalid or sentinel categories. Such analysis helps define your cleaning strategy precisely.

Profiling Valid vs. Invalid Date Patterns

We first split potential YYYY-MM-DD entries into numeric components and classify them as valid, invalid, or unknown/other based on logical rules.

-- Classify raw dates by validity and pattern
WITH parts AS (
  SELECT
    id_text,
    birthdate_raw,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 1)::int
    END AS year,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 2)::int
    END AS month,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 3)::int
    END AS day
  FROM public.visit_hospital
),
classified AS (
  SELECT
    id_text,
    birthdate_raw,
    CASE
      WHEN year IS NOT NULL THEN
        CASE
          WHEN month BETWEEN 1 AND 12
           AND day   BETWEEN 1 AND 31
          THEN 'Valid YYYY-MM-DD'
          ELSE 'Invalid YYYY-MM-DD'
        END
      ELSE 'Other/Unknown Format'
    END AS pattern
  FROM parts
)
SELECT pattern, COUNT(*) AS record_count
FROM classified
GROUP BY pattern
ORDER BY record_count DESC;
3 records
pattern record_count
Other/Unknown Format 32
Valid YYYY-MM-DD 4
Invalid YYYY-MM-DD 4

split_part() extracts year, month, and day components.

The CASE logic validates ranges without immediately casting — avoiding date/time field out of range errors.

This provides a count of valid vs. invalid entries before proceeding.

Previewing Invalid and Unrecognized Formats

To inspect the actual records in those categories, filter directly on the pattern column.

-- Inspect rows with invalid or unknown date formats
WITH parts AS (
  SELECT
    id_text,
    birthdate_raw,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 1)::int
    END AS year,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 2)::int
    END AS month,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 3)::int
    END AS day
  FROM public.visit_hospital
),
classified AS (
  SELECT
    id_text,
    birthdate_raw,
    CASE
      WHEN year IS NOT NULL THEN
        CASE
          WHEN month BETWEEN 1 AND 12
           AND day   BETWEEN 1 AND 31
          THEN 'Valid YYYY-MM-DD'
          ELSE 'Invalid YYYY-MM-DD'
        END
      ELSE 'Other/Unknown Format'
    END AS pattern
  FROM parts
)
SELECT id_text, birthdate_raw, pattern
FROM classified
WHERE pattern IN ('Invalid YYYY-MM-DD', 'Other/Unknown Format')
ORDER BY pattern, id_text
LIMIT 20;
Displaying records 1 - 10
id_text birthdate_raw pattern
Home|Quiñ|1955-10-26 1984-13-24 Invalid YYYY-MM-DD
Home|Quiñ|1955-10-26 1984-13-24 Invalid YYYY-MM-DD
Home|Quiñ|1955-10-26 1984-13-24 Invalid YYYY-MM-DD
Home|Quiñ|1955-10-26 1984-13-24 Invalid YYYY-MM-DD
Anit|Sánc|1960-05-07 NA Other/Unknown Format
Anit|Sánc|1960-05-07 NA Other/Unknown Format
Anit|Sánc|1960-05-07 NA Other/Unknown Format
Anit|Sánc|1960-05-07 NA Other/Unknown Format
Cars|Litt|2012-09-02 1986/07/04 Other/Unknown Format
Cars|Litt|2012-09-02 1986/07/04 Other/Unknown Format

This step lists actual problematic records — the type of visual audit analysts perform before applying corrections. It allows you to verify whether errors are structural (e.g., 1984-13-24) or represent other issues such as placeholders (NA, blanks, or unrecognized symbols).

Repairing and Formatting Invalid Dates

Invalid entries can be normalized through conditional parsing, string manipulation, or replacement rules. The example below demonstrates three repair techniques:

  • Convert slashed or dashed formats consistently.

  • Replace sentinel values (NA, #NULL!) with NULL.

  • Remove invalid or non-parsable text safely.

-- Clean and standardize textual date representations
SELECT
  id_text,
  birthdate_raw,
  CASE
    WHEN birthdate_raw IN ('NA', '#NULL!', '') THEN NULL
    WHEN birthdate_raw ~ '^\s*\d{4}/\d{2}/\d{2}\s*$' THEN REPLACE(birthdate_raw, '/', '-')  -- unify separator
    WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$' THEN
         TO_CHAR(TO_DATE(TRIM(birthdate_raw), 'DD-MM-YYYY'), 'YYYY-MM-DD')                -- convert to ISO
    WHEN birthdate_raw ~ '^\s*\d{4}-\d{2}-\d{2}\s*$' THEN TRIM(birthdate_raw)             -- already clean
    ELSE NULL
  END AS birthdate_formatted
FROM public.visit_hospital
LIMIT 20;
Displaying records 1 - 10
id_text birthdate_raw birthdate_formatted
Hatt|Nade|1986-07-04 NA NA
Cars|Litt|2012-09-02 1986/07/04 1986-07-04
Anit|Sánc|1960-05-07 NA NA
Doll|Fran|2008-10-04 NA NA
Cars|Litt|2012-09-02 1986/07/04 1986-07-04
Cars|Litt|2012-09-02 1986/07/04 1986-07-04
Anit|Sánc|1960-05-07 NA NA
Doll|Fran|2008-10-04 NA NA
Dwig|Hill|1984-08-24 NA NA
Tamb|Bali|1953-05-10 NA NA

This conversion converts all compatible formats into the standard YYYY-MM-DD.

Invalid, unrecognized, or sentinel entries are replaced with NULL.

Using TO_CHAR() ensures consistent ISO formatting for presentation or export.

Date-cleaning logic to ensure that all conversions are type-safe, format-consistent, and free of range errors.

Validating both the structure and value ranges of date components before conversion, we can confidently transform diverse text-based inputs into clean and usable DATE fields.

-- Safely convert and validate mixed-format date strings
WITH cleaned AS (
  SELECT
    id_text,
    CASE
      -- Handle missing or placeholder values
      WHEN birthdate_raw IN ('NA', '#NULL!', '') THEN NULL

      -- Reformat YYYY/MM/DD → YYYY-MM-DD
      WHEN birthdate_raw ~ '^\s*\d{4}/\d{2}/\d{2}\s*$'
        THEN REPLACE(birthdate_raw, '/', '-')
      -- Convert DD-MM-YYYY → YYYY-MM-DD
      WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$'
        THEN TO_CHAR(TO_DATE(TRIM(birthdate_raw), 'DD-MM-YYYY'), 'YYYY-MM-DD')

      -- Keep YYYY-MM-DD only if month/day are in valid ranges
      WHEN birthdate_raw ~ '^\s*\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])\s*$'
        THEN TRIM(birthdate_raw)
      ELSE NULL
    END AS birthdate_clean_text
  FROM public.visit_hospital
)
SELECT
  id_text,
  birthdate_clean_text,
  CASE
    -- Only cast strings that pass full structural and range validation
    WHEN birthdate_clean_text ~ '^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$'
      THEN TO_DATE(birthdate_clean_text, 'YYYY-MM-DD')
    ELSE NULL
  END AS birthdate_final
FROM cleaned
WHERE birthdate_clean_text IS NOT NULL
LIMIT 15;
Displaying records 1 - 10
id_text birthdate_clean_text birthdate_final
Cars|Litt|2012-09-02 1986-07-04 NA
Cars|Litt|2012-09-02 1986-07-04 NA
Cars|Litt|2012-09-02 1986-07-04 NA
Aure|Weis|2003-11-12 2003-11-12 2003-11-12
Mari|Lind|1966-06-23 1986-04-07 1986-04-07
Aure|Weis|2003-11-12 2003-11-12 2003-11-12
Mari|Lind|1966-06-23 1986-04-07 1986-04-07
Mari|Lind|1966-06-23 1986-04-07 1986-04-07
Aure|Weis|2003-11-12 2003-11-12 2003-11-12
Mari|Lind|1966-06-23 1986-04-07 1986-04-07

The inner CASE logic prepares a standardized YYYY-MM-DD text representation.

Regex guards ensure only syntactically and semantically valid month/day values are converted.

Invalid, blank, or sentinel entries are safely excluded.

The outer CASE performs the final casting into the true PostgreSQL DATE type.

This two-step validation (first reformat, then convert) guarantees that your code runs without any 22008: date/time field value out of range errors, even on messy or mixed-format datasets.

Preview of Successfully Cleaned Dates

-- Preview cleaned and validated dates
WITH cleaned AS (
  SELECT
    id_text,
    CASE
      WHEN birthdate_raw IN ('NA', '#NULL!', '') THEN NULL
      WHEN birthdate_raw ~ '^\s*\d{4}/\d{2}/\d{2}\s*$' THEN REPLACE(birthdate_raw, '/', '-')
      WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$' THEN TO_CHAR(TO_DATE(TRIM(birthdate_raw), 'DD-MM-YYYY'), 'YYYY-MM-DD')
      WHEN birthdate_raw ~ '^\s*\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])\s*$' THEN TRIM(birthdate_raw)
      ELSE NULL
    END AS birthdate_clean_text
  FROM public.visit_hospital
)
SELECT
  id_text,
  birthdate_clean_text,
  TO_DATE(birthdate_clean_text, 'YYYY-MM-DD') AS birthdate_final
FROM cleaned
WHERE birthdate_clean_text IS NOT NULL
ORDER BY id_text
LIMIT 20;
Displaying records 1 - 10
id_text birthdate_clean_text birthdate_final
Aure|Weis|2003-11-12 2003-11-12 2003-11-12
Aure|Weis|2003-11-12 2003-11-12 2003-11-12
Aure|Weis|2003-11-12 2003-11-12 2003-11-12
Aure|Weis|2003-11-12 2003-11-12 2003-11-12
Cars|Litt|2012-09-02 1986-07-04 1986-07-04
Cars|Litt|2012-09-02 1986-07-04 1986-07-04
Cars|Litt|2012-09-02 1986-07-04 1986-07-04
Cars|Litt|2012-09-02 1986-07-04 1986-07-04
Mari|Lind|1966-06-23 1986-04-07 1986-04-07
Mari|Lind|1966-06-23 1986-04-07 1986-04-07

This preview confirms that only valid and correctly formatted entries remain. All date strings are now standardized in ISO format (YYYY-MM-DD) and safely castable into the DATE type for further analysis.

Detecting Residual Missing or Invalid Entries

It’s useful to summarize how many rows were successfully cleaned versus how many remain missing or invalid. This summary provides a data-quality snapshot — showing the proportion of records that were valid, converted, or excluded.

-- Summarize validity of birthdate entries after cleaning
WITH final_status AS (
  SELECT
    CASE
      WHEN birthdate_raw IN ('NA', '#NULL!', '') THEN 'Missing / Placeholder'
      WHEN birthdate_raw ~ '^\s*\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])\s*$' THEN 'Valid (YYYY-MM-DD)'
      WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$' THEN 'Converted (DD-MM-YYYY)'
      WHEN birthdate_raw ~ '^\s*\d{4}/\d{2}/\d{2}\s*$' THEN 'Converted (YYYY/MM/DD)'
      ELSE 'Other / Invalid'
    END AS category
  FROM public.visit_hospital
)
SELECT category, COUNT(*) AS record_count
FROM final_status
GROUP BY category
ORDER BY record_count DESC;
5 records
category record_count
Missing / Placeholder 20
Other / Invalid 8
Converted (YYYY/MM/DD) 4
Valid (YYYY-MM-DD) 4
Converted (DD-MM-YYYY) 4

List the invalid and unknown rows

WITH normalized AS (
  SELECT
    id_text,
    birthdate_raw,
    NULLIF(TRIM(birthdate_raw), '') AS raw_trim,
    CASE
      WHEN birthdate_raw ~ '^\s*\d{4}[-/]\d{2}[-/]\d{2}\s*$'
        THEN REGEXP_REPLACE(TRIM(birthdate_raw), '/', '-', 'g')
      WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$'
        THEN TO_CHAR(TO_DATE(TRIM(birthdate_raw), 'DD-MM-YYYY'), 'YYYY-MM-DD')
      ELSE NULL
    END AS ymd_text
  FROM public.visit_hospital
),
parts AS (
  SELECT
    id_text, birthdate_raw, raw_trim, ymd_text,
    CASE WHEN ymd_text ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(ymd_text,'-',1)::int END AS y,
    CASE WHEN ymd_text ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(ymd_text,'-',2)::int END AS m,
    CASE WHEN ymd_text ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(ymd_text,'-',3)::int END AS d
  FROM normalized
),
validated AS (
  SELECT
    id_text, birthdate_raw, ymd_text, y, m, d,
    (m BETWEEN 1 AND 12)
    AND (d BETWEEN 1 AND EXTRACT(DAY FROM (date_trunc('month', make_date(y,m,1))
                                           + INTERVAL '1 month' - INTERVAL '1 day'))) AS is_valid
  FROM parts
)
SELECT
  id_text,
  birthdate_raw,
  ymd_text AS normalized_text,
  y, m, d,
  CASE
    WHEN ymd_text IS NOT NULL AND NOT is_valid THEN 'Invalid (impossible date)'
    WHEN ymd_text IS NULL THEN 'Unknown/Unparsed'
  END AS reason
FROM validated
WHERE (ymd_text IS NOT NULL AND NOT is_valid)
   OR ymd_text IS NULL
  -- exclude explicit placeholders if you like:
  AND COALESCE(TRIM(birthdate_raw),'') NOT IN ('NA','#NULL!','')
ORDER BY reason, id_text
LIMIT 20;
4 records
id_text birthdate_raw normalized_text y m d reason
Home|Quiñ|1955-10-26 1984-13-24 1984-13-24 1984 13 24 Invalid (impossible date)
Home|Quiñ|1955-10-26 1984-13-24 1984-13-24 1984 13 24 Invalid (impossible date)
Home|Quiñ|1955-10-26 1984-13-24 1984-13-24 1984 13 24 Invalid (impossible date)
Home|Quiñ|1955-10-26 1984-13-24 1984-13-24 1984 13 24 Invalid (impossible date)

Range Profiling of Cleaned Dates

After cleaning and conversion to valide dates, you can check for chronological plausibility (e.g., no future birthdates).

-- Profile chronological range of cleaned birthdates
WITH validated AS (
  SELECT
    TO_DATE(TRIM(birthdate_raw), 'YYYY-MM-DD') AS birthdate_final
  FROM public.visit_hospital
  WHERE birthdate_raw ~ '^\s*\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])\s*$'
)
SELECT
  MIN(birthdate_final) AS earliest_birth,
  MAX(birthdate_final) AS latest_birth,
  COUNT(*) AS total_valid
FROM validated;
1 records
earliest_birth latest_birth total_valid
2003-11-12 2003-11-12 4

Summary of Date Formatting

Step Technique Purpose
Pattern profiling ~ (regex match) + split_part() Identify valid, invalid, or mixed-format entries
Safe reformatting REPLACE() + TO_CHAR(TO_DATE(...)) Convert raw input to ISO YYYY-MM-DD text safely
Range validation Regex guard (0[1-9]|1[0-2]) for month, etc. Block out-of-range month/day before conversion
Safe conversion CASE WHEN ... THEN TO_DATE() Cast only confirmed valid formats to DATE
Quality summary GROUP BY category Provide audit view of data quality categories
Range check MIN() / MAX() Confirm chronological plausibility of cleaned dates

Date & Time Analysis

Profiling the Date of Birth Column

Start with an overview to check completeness and plausible ranges.

-- Profile birthdate data: completeness and range
SELECT
  COUNT(*) AS total_records,
  COUNT(birthdate) AS non_missing_birthdate,
  MIN(birthdate) AS earliest_birth,
  MAX(birthdate) AS latest_birth
FROM public.visit_hospital;
1 records
total_records non_missing_birthdate earliest_birth latest_birth
40 24 1953-05-10 2005-04-25

This confirms the time coverage of your data (e.g., from 1953–2012) and helps detect potential outliers or missing values.

Calculating births by Month and Year

-- Count births by month
SELECT
  DATE_TRUNC('month', birthdate)::date AS birth_month,
  COUNT(*) AS total_births
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
GROUP BY 1
ORDER BY birth_month;
6 records
birth_month total_births
1953-05-01 4
1955-10-01 4
1966-06-01 4
1984-08-01 4
2003-11-01 4
2005-04-01 4
-- Births by Year (show only the year number)
SELECT
  EXTRACT(YEAR FROM birthdate)::int AS birth_year,
  COUNT(*) AS total_births
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
GROUP BY 1
ORDER BY birth_year;
6 records
birth_year total_births
1953 4
1955 4
1966 4
1984 4
2003 4
2005 4

Calculating Current Age in Years The AGE() function calculates an interval between two dates, while DATE_PART() extracts the number of years.

-- Compute current age in years for each individual
SELECT
  id_text,
  birthdate,
  DATE_PART('year', AGE(CURRENT_DATE, birthdate)) AS age_years
FROM public.visit_hospital
ORDER BY age_years DESC
LIMIT 15;
Displaying records 1 - 10
id_text birthdate age_years
Hatt|Nade|1986-07-04 NA NA
Anit|Sánc|1960-05-07 NA NA
Cars|Litt|2012-09-02 NA NA
Doll|Fran|2008-10-04 NA NA
Doll|Fran|2008-10-04 NA NA
Hatt|Nade|1986-07-04 NA NA
Doll|Fran|2008-10-04 NA NA
Doll|Fran|2008-10-04 NA NA
Cars|Litt|2012-09-02 NA NA
Hatt|Nade|1986-07-04 NA NA

AGE(CURRENT_DATE, birthdate) returns the elapsed interval since birth, and DATE_PART(‘year’, …) extracts the total number of full years.

Calculating Duration from Birthdate to a Reference Date

Interval arithmetic is often used to determine time to a fixed milestone (e.g., 18th or 65th birthday).

-- Time until 65th birthday for each person
SELECT
  id_text,
  birthdate,
  (birthdate + INTERVAL '65 years') AS date_turns_65,
  AGE((birthdate + INTERVAL '65 years'), CURRENT_DATE) AS time_until_65
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
ORDER BY date_turns_65
LIMIT 15;
Displaying records 1 - 10
id_text birthdate date_turns_65 time_until_65
Tamb|Bali|1953-05-10 1953-05-10 2018-05-10 -7 years -6 mons -5 days
Tamb|Bali|1953-05-10 1953-05-10 2018-05-10 -7 years -6 mons -5 days
Tamb|Bali|1953-05-10 1953-05-10 2018-05-10 -7 years -6 mons -5 days
Tamb|Bali|1953-05-10 1953-05-10 2018-05-10 -7 years -6 mons -5 days
Home|Quiñ|1955-10-26 1955-10-26 2020-10-26 -5 years -20 days
Home|Quiñ|1955-10-26 1955-10-26 2020-10-26 -5 years -20 days
Home|Quiñ|1955-10-26 1955-10-26 2020-10-26 -5 years -20 days
Home|Quiñ|1955-10-26 1955-10-26 2020-10-26 -5 years -20 days
Mari|Lind|1966-06-23 1966-06-23 2031-06-23 5 years 7 mons 8 days
Mari|Lind|1966-06-23 1966-06-23 2031-06-23 5 years 7 mons 8 days

By adding intervals to birthdate, you can project future milestones, calculate eligibility periods, or align with policy cutoffs (e.g., retirement age).

Detect Implausible or Anomalous Ages

Identifying outliers helps ensure that demographic insights are grounded in reality.

WITH ages AS (
  SELECT
    id_text,
    DATE_PART('year', AGE(CURRENT_DATE, birthdate)) AS age_years
  FROM public.visit_hospital
  WHERE birthdate IS NOT NULL
)
SELECT
  COUNT(*) FILTER (WHERE age_years < 0)   AS negative_age,
  COUNT(*) FILTER (WHERE age_years > 100) AS unrealistic_age,
  ROUND(AVG(age_years)::numeric, 1)       AS avg_age
FROM ages;
1 records
negative_age unrealistic_age avg_age
0 0 47.3

This check ensures no impossible ages (negative or over 100 years) are present. Such records may indicate data-entry or unit conversion issues.

Temporal Trends by Birth Year or Decade Aggregating by birth year or decade reveals how age distributions evolve over time.

-- Group individuals by birth year and decade
SELECT
  FLOOR(EXTRACT(YEAR FROM birthdate) / 10) * 10 AS birth_decade,
  ROUND(CAST(AVG(DATE_PART('year', AGE(CURRENT_DATE, birthdate))) AS NUMERIC), 1) AS avg_age,
  COUNT(*) AS total_individuals
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
GROUP BY birth_decade
ORDER BY birth_decade;
4 records
birth_decade avg_age total_individuals
1950 71 8
1960 59 4
1980 41 4
2000 21 8

EXTRACT(YEAR …) pulls the numeric birth year.

Decade grouping (FLOOR(year / 10) * 10) is helpful for longitudinal or generational reporting.

Visualizing Age Distribution Over Time

-- Average current age by birth decade
SELECT
  FLOOR(EXTRACT(YEAR FROM birthdate) / 10) * 10 AS birth_decade,
  ROUND(CAST(AVG(DATE_PART('year', AGE(CURRENT_DATE, birthdate))) AS NUMERIC), 1) AS avg_age,
  COUNT(*) AS total_individuals
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
GROUP BY birth_decade
ORDER BY birth_decade;
4 records
birth_decade avg_age total_individuals
1950 71 8
1960 59 4
1980 41 4
2000 21 8

This provides an age summary across decades — a compact representation of how age structure shifts over time.

Part VII — Descriptive Statistics & Tabulation

  • Aggregations & GROUP BY; HAVING filters
  • Ordered-set aggregates (percentiles)
  • Crosstab/pivot with tablefunc extension

Once the data is clean and structured, the next step is summarization and tabulation — transforming granular observations into interpretable statistics.
This section demonstrates descriptive techniques such as aggregation with GROUP BY, conditional filtering with HAVING, calculating percentiles, and reshaping summaries into cross-tabulated (pivoted) formats.

All examples below use the public.global_health_statistics dataset.


1️⃣ Aggregation with GROUP BY

Aggregations condense large datasets into concise summaries — totals, averages, and counts by meaningful categories. The GROUP BY clause in SQL is the equivalent of group_by() + summarise() in R.

In data analytics, this is how we turn millions of records into tables that inform decision-making.

In the example below, the query groups data by both country and disease, and computes three key metrics:

  • The average number of people affected per record,
  • The total number of people affected (overall disease burden),
  • The number of records contributing to each group. The example below shows how GROUP BY can be used to generate aggregated insights across multiple dimensions, revealing where disease burdens are highest.

-- Average population affected by disease and country
SELECT
  country,
  disease_name,
  ROUND(AVG(population_affected)::numeric, 0) AS avg_population_affected,
  SUM(population_affected)                     AS total_population_affected,
  COUNT(*)                                     AS records_count
FROM public.global_health_statistics
GROUP BY country, disease_name
ORDER BY total_population_affected DESC
LIMIT 15;
Displaying records 1 - 10
country disease_name avg_population_affected total_population_affected records_count
Germany Tuberculosis 994653 24866314 25
Mexico Ebola 994470 24861757 25
Turkey Dengue 994385 24859631 25
Argentina Ebola 994191 24854765 25
France Hepatitis 994065 24851614 25
Argentina Diabetes 993911 24847772 25
Mexico Rabies 993845 24846123 25
South Korea Parkinson’s Disease 993740 24843491 25
Turkey HIV/AIDS 993719 24842966 25
Italy Polio 993676 24841908 25

Context: This type of aggregation could support a quick ranking of health priorities — for instance, identifying which diseases have historically affected the largest populations within each country. Using both AVG() and SUM() together allows you to distinguish between typical impact per observation and overall cumulative impact. The ORDER BY ensures that the most significant results (in terms of total affected population) appear at the top.

2️⃣HAVING for Filtering Aggregated Results

After summarizing, it’s often useful to narrow down results to only those groups that meet specific conditions — for example, diseases with exceptionally high case counts. Unlike WHERE, which filters individual rows before aggregation, HAVING filters aggregated groups after summary calculations.

The example below shows how to isolate high-impact diseases affecting over one million people in total.

-- Show only diseases with more than 1 million people affected
SELECT
  disease_name,
  SUM(population_affected) AS total_population
FROM public.global_health_statistics
GROUP BY disease_name
HAVING SUM(population_affected) > 1000000
ORDER BY total_population DESC
LIMIT 10;
Displaying records 1 - 10
disease_name total_population
Polio 495323206
Tuberculosis 495320883
Parkinson’s Disease 495304287
Ebola 495254480
Hypertension 495155070
Cholera 495153078
Alzheimer’s Disease 495140228
Zika 495121139
COVID-19 495111737
Cancer 495100583

Here, HAVING ensures we only report diseases where the cumulative affected population exceeds 1 million. Context: This query is useful when performing threshold-based analysis — such as identifying key health challenges that exceed certain severity levels. It could easily be adapted for different analytical questions, such as finding diseases with a rapid increase over time or regions exceeding target incidence rates.

3️⃣ Ordered-Set Aggregates: Percentiles & Medians

While aggregates like AVG() provide a measure of central tendency, percentiles and medians describe distribution and spread. They are especially useful when data contains outliers — for example, one extreme outbreak could distort the average, while the median gives a more typical view.

The example below shows how to compute quartiles (25th, 50th, 75th percentiles) of population affected for each disease.

-- Compute median and quartiles of affected population by disease
SELECT
  disease_name,
  PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY population_affected) AS q1,
  PERCENTILE_CONT(0.5)  WITHIN GROUP (ORDER BY population_affected) AS median,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY population_affected) AS q3
FROM public.global_health_statistics
WHERE population_affected IS NOT NULL
GROUP BY disease_name
ORDER BY median DESC
LIMIT 10;
Displaying records 1 - 10
disease_name q1 median q3
Zika 986952.5 993871.0 997424.8
COVID-19 986470.8 993712.0 996905.0
Parkinson’s Disease 986463.0 993632.5 997368.5
Cholera 987292.5 993503.5 997038.0
Polio 987037.5 993333.5 996737.0
Measles 985969.0 993319.5 997046.5
Leprosy 985762.0 993277.5 997048.2
Diabetes 987171.0 993161.0 997301.8
Tuberculosis 987507.0 993129.5 997328.0
Ebola 986789.0 993120.5 997368.8

Context: Percentiles help identify which diseases have the widest or most concentrated spread of population impact. For instance, a disease with a large gap between the 25th and 75th percentile likely affects different regions very unevenly — an insight valuable for targeted interventions.

4️⃣Reshaping Data: Cross-Tabulation (Pivoting Data)

Data reshaping refers to transforming the structure of a dataset — switching between long and wide formats depending on analytical or reporting needs.
In long format, each observation occupies one row (e.g., each record represents a country–year–disease combination).
In wide format, a single row may represent a country or year, and multiple variables or categories become columns (e.g., diseases side by side).

Reshaping is a key part of preparing data for visualizations, dashboards, and comparative summaries.
While SQL is inherently row-oriented, PostgreSQL’s tablefunc extension and some creative joins make reshaping feasible directly within SQL.

These transformations are common in epidemiology, public health reporting, and data science workflows: - Analysts prefer long format for modeling and statistical analysis.
- Decision-makers and stakeholders often prefer wide format for dashboards, comparisons, and quick scanning.

The following examples demonstrate both wide and long reshaping operations using the global_health_statistics dataset.

4.1: Reshaping to Wide Format

Pivoting data turns long-format tables into wide-format summaries, similar to Excel pivot tables or R’s pivot_wider(). This is particularly useful for comparing values side by side — such as annual totals per disease.

In PostgreSQL, pivoting can be done using the crosstab() function from the tablefunc extension.

Why important: Stakeholders often ask for “diseases across columns by year.” The example below shows how I pivot totals by year into a wide table for quick scanning.

-- Enable the extension (run once per database)
CREATE EXTENSION IF NOT EXISTS tablefunc;
SELECT *
FROM crosstab(
  $$
    SELECT year_event::INT, disease_name, SUM(population_affected)::NUMERIC
    FROM public.global_health_statistics
    GROUP BY year_event, disease_name
    ORDER BY year_event, disease_name
  $$
) AS ct(
  year_event INT,
  malaria NUMERIC,
  dengue NUMERIC,
  influenza NUMERIC,
  covid19 NUMERIC
)
LIMIT 10;
Displaying records 1 - 10
year_event malaria dengue influenza covid19
2000 19835840 19806950 19801964 19865735
2001 19794559 19817181 19824040 19822067
2002 19836984 19816082 19759866 19784016
2003 19773534 19704438 19764751 19803744
2004 19792759 19825219 19778123 19787008
2005 19850519 19787102 19870901 19746345
2006 19846542 19769208 19814068 19747205
2007 19847607 19821630 19766503 19760524
2008 19735264 19848897 19823537 19756148
2009 19718827 19763020 19770930 19802159

Context: This example shows how a long transactional dataset can be reshaped into a compact, comparative view where each disease becomes a column. Such output is ideal for visualizing trends or preparing data for dashboards (e.g., year-over-year change in disease prevalence).

4.2: Reshaping to Long Format

Long format is often preferred for modeling, analytics, and machine learning pipelines because it’s tidy — every variable has its own column, and every observation has its own row. When you receive data in a wide structure (for example, one column per disease), converting it back to long format makes it easier to aggregate, filter, or compute metrics by category.

Why this matters: Analytical workflows (e.g., regression, forecasting, clustering) often expect a normalized long structure. This format aligns better with relational database principles and is easier to extend when new diseases or measures appear.

In PostgreSQL, this can be achieved with UNION ALL or the more compact UNNEST() approach.

SELECT year_event, disease_name, population_affected
FROM (
    SELECT *
    FROM crosstab(
      $$
        SELECT year_event::INT, disease_name, SUM(population_affected)::NUMERIC
        FROM public.global_health_statistics
        GROUP BY year_event, disease_name
        ORDER BY year_event, disease_name
      $$
    ) AS ct(
      year_event INT,
      malaria NUMERIC,
      dengue NUMERIC,
      influenza NUMERIC,
      covid19 NUMERIC
    )
) crosstab_result
CROSS JOIN LATERAL unnest(
    ARRAY['malaria','dengue','influenza','covid19'],
    ARRAY[malaria,dengue,influenza,covid19]
) AS t(disease_name, population_affected)
ORDER BY year_event, disease_name;
Displaying records 1 - 10
year_event disease_name population_affected
2000 covid19 19865735
2000 dengue 19806950
2000 influenza 19801964
2000 malaria 19835840
2001 covid19 19822067
2001 dengue 19817181
2001 influenza 19824040
2001 malaria 19794559
2002 covid19 19784016
2002 dengue 19816082
  • ARRAY[…] defines disease names.

  • ARRAY[…] holds the corresponding values from each column.

  • UNNEST expands them into long rows.

Another option to achieve the same result could be using UNION ALL:

Below is a demonstration using UNION ALL to manually “unpivot” wide data into long form.

-- Example: manually reshaping a wide table back to long format
-- (assuming we already have a pivoted summary table)
WITH disease_wide AS (
  SELECT *
  FROM crosstab(
    $$
      SELECT year_event::INT, disease_name, SUM(population_affected)::NUMERIC
      FROM public.global_health_statistics
      GROUP BY year_event, disease_name
      ORDER BY year_event, disease_name
    $$
  ) AS ct(
    year_event INT,
    malaria NUMERIC,
    dengue NUMERIC,
    influenza NUMERIC,
    covid19 NUMERIC
  )
)
SELECT year_event, 'malaria'   AS disease_name, malaria   AS population_affected FROM disease_wide
UNION ALL
SELECT year_event, 'dengue',     dengue   FROM disease_wide
UNION ALL
SELECT year_event, 'influenza',  influenza FROM disease_wide
UNION ALL
SELECT year_event, 'covid19',    covid19   FROM disease_wide
ORDER BY year_event, disease_name
LIMIT 20;
Displaying records 1 - 10
year_event disease_name population_affected
2000 covid19 19865735
2000 dengue 19806950
2000 influenza 19801964
2000 malaria 19835840
2001 covid19 19822067
2001 dengue 19817181
2001 influenza 19824040
2001 malaria 19794559
2002 covid19 19784016
2002 dengue 19816082

5️⃣ Multi-Level Summaries

When you want to track patterns over time, it’s helpful to summarize data at multiple hierarchical levels. The query below aggregates the total population affected by year and country, producing a panel-like view suitable for temporal trend analysis.

-- Yearly total affected population per country
SELECT
  country,
  year_event,
  SUM(population_affected) AS total_affected
FROM public.global_health_statistics
GROUP BY country, year_event
ORDER BY country, year_event;
Displaying records 1 - 10
country year_event total_affected
Argentina 2000 19817875
Argentina 2001 19835072
Argentina 2002 19839406
Argentina 2003 19825531
Argentina 2004 19868546
Argentina 2005 19803811
Argentina 2006 19873224
Argentina 2007 19812365
Argentina 2008 19796552
Argentina 2009 19806823

Context: This example demonstrates how a simple GROUP BY over multiple dimensions can reveal longitudinal trends — such as whether a country’s total burden is increasing or decreasing over time. The resulting dataset could serve as the basis for line charts or heatmaps in further visualization steps.


🔎 Summary Table: Descriptive Statistics & Reshaping Techniques

Technique Purpose SQL Example When to Use
GROUP BY + aggregates (SUM, AVG, COUNT) Collapse raw rows into grouped summaries SELECT country, disease_name, SUM(population_affected) FROM ... GROUP BY country, disease_name; Rank burdens by country/disease, create quick summaries
HAVING filters Filter groups after aggregation ... GROUP BY disease_name HAVING SUM(population_affected) > 1000000; Identify only high-impact diseases or threshold exceedances
Ordered-set aggregates (PERCENTILE_CONT) Compute medians, quartiles, or percentiles SELECT disease_name, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY population_affected) FROM ... GROUP BY disease_name; Understand distributions, reduce impact of outliers
crosstab() (wide pivot) Reshape long data into wide format (like Excel pivot tables) SELECT * FROM crosstab($$SELECT year_event, disease_name, SUM(population_affected) ...$$) AS ct(...); Dashboards, year-over-year comparisons, stakeholder summaries
UNNEST() / UNION ALL (long unpivot) Convert wide back to long/tidy format SELECT year_event, disease_name, population_affected FROM wide CROSS JOIN LATERAL unnest(...); Feed into modeling, visualization, machine learning
Multi-level summaries Aggregate across multiple hierarchical dimensions SELECT country, year_event, SUM(population_affected) FROM ... GROUP BY country, year_event; Trend analysis across time and geography

Key Insight:
- Wide format is great for comparative dashboards, stakeholder reporting, and scanning.
- Long format is better for tidy analysis, visualisation, and interoperability with tools like R or Python.
- SQL’s flexibility (GROUP BY, HAVING, PERCENTILE_CONT, crosstab, UNNEST) allows analysts to move fluidly between exploratory statistics and presentation-ready summaries.


Key Takeaways

Descriptive SQL transforms raw transactional data into interpretable insights through summarization and restructuring.

The ability to move fluidly between long and wide formats reflects practical analytical flexibility — catering to both technical and decision-maker audiences.

Part VIII — Performance Basics

Database performance tuning is an essential skill when working with large datasets such as public.global_health_statistics. Even seemingly simple queries can become very slow if the database needs to scan millions of rows. This section demonstrates three core performance techniques — indexing, reading query plans, and simple tuning habits — using examples from the dataset. Performance optimization in SQL typically involves:

  • Creating the right indexes
  • Understanding how queries are executed with EXPLAIN ANALYZE
  • Applying simple tuning strategies to improve speed

1. Indexing Strategies

SQL databases use B-tree indexes by default. They work like a sorted lookup structure that makes finding rows faster.

When to use indexes: - On filtering columns (WHERE) - On grouping keys (GROUP BY) - On ordering columns (ORDER BY) - On join keys between tables

Indexes act like a book’s table of contents: instead of flipping through every page, the database can jump directly to the rows it needs. For instance:

-- Without an index: PostgreSQL will likely do a sequential scan
EXPLAIN ANALYZE
SELECT *
FROM public.global_health_statistics
WHERE country = 'Australia';
7 records
QUERY PLAN
Bitmap Heap Scan on global_health_statistics (cost=8.16..1796.94 rows=500 width=168) (actual time=0.143..0.442 rows=500 loops=1)
Recheck Cond: (country = ‘Australia’::text)
Heap Blocks: exact=497
-> Bitmap Index Scan on idx_ghs_country (cost=0.00..8.04 rows=500 width=0) (actual time=0.091..0.091 rows=500 loops=1)
Index Cond: (country = ‘Australia’::text)
Planning Time: 0.077 ms
Execution Time: 0.467 ms

Indexes speed up queries by allowing the database to look up rows efficiently rather than scanning the entire table.

CREATE INDEX IF NOT EXISTS idx_ghs_country ON public.global_health_statistics (country);

– Re-run the query EXPLAIN ANALYZE SELECT * FROM public.global_health_statistics WHERE country = ‘Australia’;

With the index in place, the plan switches to an Index Scan, reducing I/O dramatically. On large datasets, this difference is measured in seconds or minutes saved e.g., from 2.9 micro seconds to 0.88 micro seconds in this example (see the screenshot below).

The following screenshot shows the execution time difference observed before and after adding an index:

PostgreSQL query plan comparison PostgreSQL query plan comparison 

2. Reading Query Plans with EXPLAIN ANALYZE

Indexes are only useful if the database actually uses them. That’s where EXPLAIN ANALYZE comes in. It shows the chosen plan and the actual execution time.

– Create index on year_event CREATE INDEX IF NOT EXISTS idx_ghs_year_event ON public.global_health_statistics (year_event);

– Re-run the query to see faster plan EXPLAIN ANALYZE SELECT year_event, COUNT(*) AS n_records FROM public.global_health_statistics GROUP BY year_event ORDER BY year_event;

The output explains whether PostgreSQL used sorting, hashing, or index scans. If grouping by year is frequent, adding an index on year may shift the plan from expensive full sorts to more efficient operations.

3. Simple Tuning Tips

Even without adding new indexes, developers can make queries run faster by following small habits:

a) Predicates: Make Them SARGable

SARGable = Search Argument Able. Avoid wrapping columns in functions.

❌ Non-SARGable

Good (SARGable)

WHERE mortality_pct = 5.0

WHERE year = 2020

b) Projections: Select Only What You Need

– SELECT * – FROM public.global_health_statistics;

SELECT country, year_event, disease_name, mortality_pct
FROM public.global_health_statistics
LIMIT 20;
Displaying records 1 - 10
country year_event disease_name mortality_pct
Nigeria 2013 Asthma 9.15
Turkey 2010 Tuberculosis 2.75
Argentina 2006 Ebola 3.16
Saudi Arabia 2013 Hypertension 5.57
Nigeria 2004 COVID-19 8.77
Russia 2012 Parkinson’s Disease 6.85
Saudi Arabia 2009 Cancer 1.58
Russia 2009 Ebola 4.46
Nigeria 2022 Polio 1.13
India 2016 Leprosy 8.48

c) Ordering: Use Index Support If you frequently order by columns like population_affected, consider an index.

CREATE INDEX idx_population_affected ON public.global_health_statistics(population_affected) LIMIT 15;

SELECT country, population_affected FROM public.global_health_statistics ORDER BY population_affected DESC LIMIT 15;

Conclusion

With even these basic optimizations, you can significantly improve query performance:

Indexing frequently used filter, grouping, and ordering columns (year_event, country, disease_name, population_affected).

Reading query plans with EXPLAIN ANALYZE to understand execution.

Tuning queries by making predicates SARGable, projecting only needed columns, and leveraging indexes for ordering.

These practices ensure your queries on global_health_statistics remain efficient even as the dataset grows.

Part IX — Security, Backup & Sharing

Ensuring data integrity and accessibility means thinking beyond querying. This section covers how to:

  • Secure access with roles & privileges
  • Create backups and restore databases
  • Export and share query results for reporting

1. Roles & Privileges — Granting Read-Only Access to Learners

PostgreSQL supports fine-grained control over who can connect to databases and what they can do inside.

Give Learners Read-Only Access to Global Health Data

– Identify the Database name SELECT datname FROM pg_database;

A. Create the Role for the Learner

DO $$
BEGIN
  IF NOT EXISTS (
    SELECT FROM pg_roles WHERE rolname = 'learner_user'
  ) THEN
    CREATE ROLE learner_user WITH LOGIN PASSWORD 'SafePass123';
  END IF;
END $$;

B. Grant Database Access

-- Allow learner_user to connect to the specific database eg SQL_Data_Manipulation database
GRANT CONNECT ON DATABASE "SQL_Data_Manipulation" TO learner_user;

C. Grant Schema-Level Access

-- Allow the user to access objects in the 'public' schema
GRANT USAGE ON SCHEMA public TO learner_user;

D. Grant Read Access to Specific Tables

-- Allow SELECT queries only
GRANT SELECT ON public.global_health_statistics TO learner_user;

Now, learner_user can log in, browse the schema, and query the dataset — but cannot modify anything.

Create Multiple Learners with a Group Role Instead of managing access per individual, create a shared group role and assign learners to it.

-- Create the role group
DO $$
BEGIN
  IF NOT EXISTS (
    SELECT FROM pg_roles WHERE rolname = 'learners_group'
  ) THEN
    CREATE ROLE learners_group;
  END IF;
END $$;
-- Create individual users
DO $$
BEGIN
  IF NOT EXISTS (
    SELECT FROM pg_roles WHERE rolname = 'learner_ali'
  ) THEN
    CREATE ROLE learner_ali WITH LOGIN PASSWORD 'Ali123';
  END IF;
END $$;
DO $$
BEGIN
  IF NOT EXISTS (
    SELECT FROM pg_roles WHERE rolname = 'learner_sara'
  ) THEN
    CREATE ROLE learner_sara WITH LOGIN PASSWORD 'Sara123';
  END IF;
END $$;
GRANT learners_group TO learner_sara;
-- Grant group access to the database
GRANT CONNECT ON DATABASE "SQL_Data_Manipulation" TO learners_group;
GRANT USAGE ON SCHEMA public TO learners_group;
GRANT SELECT ON public.global_health_statistics TO learners_group;

Audit Who Has Access

SELECT rolname
FROM pg_roles
WHERE has_database_privilege(rolname, 'SQL_Data_Manipulation', 'CONNECT');
Displaying records 1 - 10
rolname
pg_database_owner
pg_read_all_data
pg_write_all_data
pg_monitor
pg_read_all_settings
pg_read_all_stats
pg_stat_scan_tables
pg_read_server_files
pg_write_server_files
pg_execute_server_program

To see which roles have SELECT access on global_health_statistics:

SELECT grantee, privilege_type
FROM information_schema.role_table_grants
WHERE table_name = 'global_health_statistics';
9 records
grantee privilege_type
postgres TRIGGER
postgres REFERENCES
postgres TRUNCATE
postgres DELETE
postgres UPDATE
postgres SELECT
postgres INSERT
learner_user SELECT
learners_group SELECT

2. Backups and Restores

🔸 Using pgAdmin (GUI)

To Backup:

  1. Right-click on SQL_Data_Manipulation

  2. Click Backup

  3. Choose format: Custom

  4. Save the file (.backup)

To Restore:

  1. Create a new empty database (e.g., SQL_Health_Restore)

  2. Right-click → Restore

  3. Load the .backup file

3. Exporting Results and Reporting with Quarto

Export to CSV using psql

-- **Export to CSV using psql**
COPY (
  SELECT disease_name, year_event, AVG(mortality_pct) AS avg_mortality
  FROM public.global_health_statistics
  GROUP BY disease_name, year_event
) TO 'C:\\temp\\disease_mortality.csv' CSV HEADER;

B. Render a Quarto Table or Chart

SELECT 
  disease_name,
  ROUND(AVG(mortality_pct)::numeric, 2) AS avg_mortality
FROM public.global_health_statistics
GROUP BY disease_name
ORDER BY avg_mortality DESC
LIMIT 10;
Displaying records 1 - 10
disease_name avg_mortality
Malaria 5.20
Ebola 5.09
Cholera 5.08
Tuberculosis 5.06
Parkinson’s Disease 5.06
COVID-19 5.05
Hepatitis 5.05
Diabetes 5.04
Influenza 5.03
Measles 5.02
df <- read_csv("C:/temp/disease_mortality.csv")
Rows: 500 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): disease_name
dbl (2): year_event, avg_mortality

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(df, aes(x = reorder(disease_name, avg_mortality), y = avg_mortality)) +
geom_col(fill = "#0077cc") +
coord_flip() +
labs(title = "Top 10 Diseases by Average Mortality",
x = "Disease", y = "Mortality %")

Database Administration & Reporting Tasks

Task Tool Notes
Grant DB Access SQL GRANT CONNECT, GRANT SELECT
Bulk Learner Setup SQL + Group Role Use learners_group
Backup pgAdmin / pg_dump Choose Custom format
Restore pgAdmin / pg_restore Create empty DB first
Export \COPY in psql Save results as CSV
Reporting Quarto + ggplot Use dynamic SQL chunks

By combining PostgreSQL’s security model, backup features, and Quarto’s reporting power, you’re equipped to build a secure, scalable, and sharable data analysis environment.

📝 Summary and Conclusion

This presentation provided an end-to-end journey through PostgreSQL on Windows, covering everything from installation to advanced analytics and reporting.

We began with the foundations of SQL (ACID principles, PostgreSQL ecosystem, pgAdmin/psql), then demonstrated data loading and exploration. From there, we explored both core wrangling (SELECT, WHERE, JOIN, CASE) and advanced techniques (CTEs, window functions, deduplication, upserts).

The middle sections focused on descriptive statistics (aggregation, percentiles, crosstabs) and data quality (missing data handling, casting/cleaning, working with dates).
Next, we examined performance basics (indexing, query plans, tuning practices) to ensure queries run efficiently even on large health datasets.

Finally, we addressed the critical topics of security, backup, and sharing, demonstrating how to manage roles and privileges, perform reliable backups, and integrate results with Quarto for transparent reporting.


🎯 Key Takeaways

  • SQL is more than queries — it is a full environment for storing, securing, analyzing, and sharing data.
  • PostgreSQL + pgAdmin provide a robust platform for analytics work.
  • Integration with R and Quarto allows reproducible, professional-grade reporting.
  • Data governance practices (indexes, roles, backups) are as important as analysis techniques.

By combining technical SQL skills with performance, governance, and reproducibility, analysts can deliver results that are accurate, efficient, and trusted.


📚 References & Learning Resources

For further study and practice, here are recommended books and websites

Books - Beaulieu, A. (2020). Learning SQL: Generate, manipulate, and retrieve data (3rd ed.). O’Reilly Media. ISBN: 978-1492057611

  • Obe, R. O., & Hsu, L. S. (2017). PostgreSQL: Up and running: A practical guide to the advanced open source database (3rd ed.). O’Reilly Media. ISBN: 978-1491963418

  • DeBarros, A. (2018). Practical SQL: A beginner’s guide to storytelling with data. No Starch Press. ISBN: 978-1593278274

Websites & Documentation - PostgreSQL Official Documentation
- Postgres Tutorial
- Mode Analytics SQL Tutorial
- RStudio Quarto Documentation