🧰 Objective and Contents of the Presentation

This presentation provides a focused, end-to-end path for working with PostgreSQL on Windows—from installation to analysis and reporting. It begins by installing PostgreSQL with the EnterpriseDB bundle, registering the server in pgAdmin, and using StackBuilder for optional tools. It then shows how to create a new database, import data (e.g., .csv), and perform core SQL data manipulation (filtering, grouping, joins, window functions). Finally, it demonstrates how to connect SQL with R via Quarto/R Markdown to generate reproducible reports in HTML/PDF that combine narrative, code, and results.

🧭 Presentation Roadmap

The flow is intentional: begin with foundations (Part I), load and explore data (Part II), then connect and publish reproducible outputs (Part III). After that, move through core and advanced wrangling (Parts IV–V), descriptive statistics (Part VI), data quality and types (Part VII), performance tuning (Part VIII), and finally security, backup, and sharing (Part IX).

Part I — Introduction and Installing SQL

What is SQL & why it matters (ACID, integrity, performance)
PostgreSQL overview & tooling (pgAdmin, psql)

Part II — Connecting with R, Rmarkdown, Quarto - Connecting from R/Quarto; running and saving scripts; comments; shortcuts

Part III — Data Loading & Exploration

Table creation (DDL) & data types
CSV import: COPY / \copy and permissions
Basic exploration: previews, counts, distincts, NULL checks

Part IV — Wrangling Core (SELECT / WHERE / CASE / JOIN)

Selecting & aliasing; computed columns
Filtering, sorting, limiting
Conditional logic with CASE, COALESCE, NULLIF
Joins (INNER/LEFT/RIGHT/FULL) and practical patterns
Appending data

Part V — Advanced Wrangling

CTEs (WITH) for readable pipelines
Window functions (ranking, partitions, moving calculations)
Deduplication with ROW_NUMBER(); upserts overview

Part VI — Data Quality, Types & Dates

Missing data strategies (fill, filter, profile)
Casting & cleaning (regex, numeric/text conversions)
Date/time basics and analysis

Part VII — Descriptive Statistics & Tabulation

Aggregations & GROUP BY; HAVING filters
Ordered-set aggregates (percentiles)
Crosstab/pivot with tablefunc extension

Part VIII — Performance Basics

Indexing strategies (B-tree on filters/group keys)
Reading query plans with EXPLAIN ANALYZE
Simple tuning tips (predicates, projections, ordering)

Part IX — Security, Backup & Sharing

Roles & privileges (grant read-only to learners)
Backups: pgAdmin / pg_dump & restores
Exporting results (CSV/HTML) & reporting with Quarto

Part I — Introduction: Installing, RDBMS, operations and toolkit

This section sets the context: what SQL is, when a relational database is the right tool, and how ACID {Atomicity (all-or-nothing transactions), Consistency (always obeys constraints), Isolation (concurrent work behaves as if run alone), Durability (commits persist even after failures)} properties and well-designed schemas support reliable analytics at scale. It briefly compare mainstream RDBMS options. Then I walk through a clean Windows install using the EnterpriseDB bundle, registering the server in pgAdmin and verifying connectivity before any data work. In short, the hands-on toolkit centers on SELECT … FROM with WHERE filters, ORDER BY/LIMIT shaping, and DISTINCT/aliases; aggregations (COUNT/SUM/AVG) with GROUP BY/HAVING; joins to combine tables; and a few essentials—COALESCE/CASE, type casts (::), and common string/date helpers.

🧭 Relational Database Management Systems (RDBMS) Landscape

Relational Database Management Systems (RDBMS) are the engines that store and query structured, table-shaped data using SQL. They matter because they enforce data integrity (ACID transactions), power reporting and analytics, and keep apps reliable under load. Your choice of RDBMS impacts performance, features you can lean on, skills availability in the team, and ongoing licensing/hosting costs.

PostgreSQL — Open-source, strong data types, constraints, JSON, CTEs (Common Table Expression), window functions, and extensions (e.g., PostGIS). Excellent for analytics + app backends.
MySQL / MariaDB — Common in web stacks and hosting; straightforward and lightweight. Great for websites and SaaS backends.
Microsoft SQL Server — Popular in enterprise settings (Azure/Windows shops). Tight integration with Power BI and the MS ecosystem.
Oracle Database — Feature-rich enterprise workhorse used by government/finance; heavy on licensing and administration.
Managed Cloud (DBaaS) — AWS RDS / Azure / Google Cloud SQL hosting PostgreSQL/MySQL/SQL Server with patching, backups, and scaling handled for you.

This Project choice — PostgreSQL:
I used PostgreSQL for this project because it offers modern SQL features without licence, runs well on Windows and all major clouds, and scales from local development to production. It has great community support and extensions when you need them.

🔗 Downloading and Installing SQL

Download the PostgreSQL installer from:

👉 https://www.enterprisedb.com/downloads/postgres-postgresql-downloads

Select the Windows version eg x86-64. This installer is bundled with:

PostgreSQL Server
pgAdmin
StackBuilder
Command-line tools like psql

Bundling everything in one installer saves setup headaches later on.

📥 Installation Steps (Windows)

Run the Installer
Double-click the .exe file to launch the setup wizard.
Select Components
Leave all defaults selected:
- ✅ PostgreSQL Server
- ✅ pgAdmin
- ✅ Command Line Tools
- 🔄 StackBuilder (highly recommended)
Choose Installation Directory
Stick with: C:\Program Files\PostgreSQL\<version>, unless you have a good reason to move it.
Set a Password for postgres
This password secures the default admin account.
💡 Tip: store it safely — you’ll need it to log into pgAdmin later.
Keep the Default Port
Port 5432 is standard for PostgreSQL. Change it only if you know you have a conflict.
Accept the Locale Settings
The default system locale works fine for most projects. You can later create databases with a specific locale if required
Finish the Installation
After confirming settings, the installer will complete the setup and initialise the database service.

First Launch: Setting Up pgAdmin

Launch pgAdmin from the Start Menu.

Click “Add New Server”

You’ll be prompted to register your PostgreSQL server so pgAdmin can manage it.
Configure the Connection
- General tab:
  - Name: Local PostgreSQL
- Connection tab:
  - Host: localhost
  - Port: 5432
  - Username: postgres
  - Password: (your password from earlier)
✅ Check Save Password so you’re not prompted every time.
Click Save

You’ll now see your server listed in the left pane. Expand it to view databases, schemas, tables, and more.
- If you see two entries (e.g., Local PostgreSQL and PostgreSQL 17), they’re just pgAdmin shortcuts. Keep the auto-detected one and feel free to delete the duplicate ne eg localhost.
- You’ll reconnect each time you reopen pgAdmin (right-click server → Connect). If the password isn’t remembered, re-enter it and tick Save Password.

Verifying the Connection

Right-click the server → Connect
Expand Databases > postgres > Schemas > public
This confirms your PostgreSQL server is active and accessible.

You’re now ready to manage your databases through the GUI.

🔌 Using StackBuilder to Install Extensions

Select your PostgreSQL instance (e.g., PostgreSQL 15 on port 5432)
Click Next
You’ll see several categories. Here’s a guide:

What is recommended:

Category	Description
Add-ons, tools and utilities	Tools like `pgAgent`, `pgBackRest`, or `adminpack` — helpful for backups, scheduling, and server-side utilities.
Database Drivers	ODBC, JDBC, and .NET drivers — install if you’re planning to connect PostgreSQL from BI tools or other languages like Java or R.

Optional:

Category	Use only if…
Spatial Extensions	You need GIS support — this includes PostGIS.
Web Development	You’re working on a full PHP/PostgreSQL stack (less common).
Database Server	Already installed via initial setup — no need to repeat.
Registration-required	Mostly trialware; skip unless testing something specific.

Once selected, follow the prompts to download and install the tools/extensions you need.

- Reconnect behaviour: Each time you reopen pgAdmin, right-click the server → Connect (password won’t be prompted if saved).

If you see fe_sendauth: no password supplied, open the server properties → Connection tab → re-enter password and tick Save Password if required.

SQL Essential Functions and Operations — Overview

This section offers an overview of essential SQL functions and operations, focusing on their use in data manipulation. We’ll explore core concepts that allow you to retrieve, filter, sort, and aggregate data, which are fundamental to working with any dataset. Later, these principles will be demonstrated using the public.global_health_statistics dataset within the SQL_Data_Manipulation database.

Querying Basics Use SELECT ... FROM ... to retrieve columns and rows. Add WHERE to filter, ORDER BY to sort, LIMIT to cap rows, and DISTINCT to remove duplicates. Column and table aliases keep queries readable.

Operators & Predicates Core building blocks for conditions: comparison (=, <>, <, <=, >, >=), logical (AND, OR, NOT), membership (IN), ranges (BETWEEN), pattern match (LIKE, ILIKE), null checks (IS NULL, IS NOT NULL), and regex (~, ~*) in PostgreSQL.

Identifiers, Aliases, Schemas Qualify objects as schema.table (e.g., public.global_health_statistics). Use AS for aliases (e.g., ghs). Strings use single quotes '...'; double quotes preserve case/characters in identifiers; end statements with ;. Comments: -- line and /* block */.

Aggregation & Grouping Aggregate with COUNT, SUM, AVG, MIN, MAX. Use GROUP BY to define groups and HAVING to filter aggregated results.chn

Joining Data Combine tables on keys. Typical joins: INNER (matches), LEFT (preserve left), RIGHT, FULL, and CROSS. Prefer explicit JOIN ... ON ... with clear key conditions.

Subqueries & CTEs Nest queries inside WHERE/FROM or use CTEs with WITH. CTEs improve readability and can be chained; PostgreSQL also supports WITH RECURSIVE for hierarchical problems.

Set Operations Stack compatible result sets vertically: UNION (dedup), UNION ALL (no dedup), INTERSECT (overlap), EXCEPT (left minus right).

Window Functions Compute ranks, running totals, and moving averages without collapsing rows: ... OVER (PARTITION BY ... ORDER BY ...). Common helpers: ROW_NUMBER, RANK, LAG, LEAD, SUM/AVG OVER.

Strings, Numbers, Dates Strings: UPPER, LOWER, TRIM, LENGTH, REPLACE, SUBSTRING, concatenation ||.
Numbers: ROUND, CEIL, FLOOR, safe division with NULLIF.
Dates: NOW, EXTRACT, DATE_TRUNC, MAKE_DATE. Cast types with :: or CAST(...).

Conditional & NULL Handling CASE builds conditional expressions; COALESCE picks the first non-NULL value; NULLIF(a,b) returns NULL if a=b (useful for safe division).

DDL (Data Definition Language) Structures & Indexes Define and evolve schema: CREATE/ALTER/DROP TABLE, constraints (PRIMARY KEY, UNIQUE, NOT NULL, CHECK), and CREATE INDEX to speed up common filters/joins.

Transactions & Security Wrap changes in BEGIN ... COMMIT (or ROLLBACK). Use SAVEPOINT for partial rollbacks. Control access with GRANT / REVOKE.

Loading & Export Bulk load/export CSV using COPY (server-side) or \copy (psql client-side). On Windows, prefer forward slashes (C:/...), and if permissions fail, move files to C:/temp/.

Introspection & “Help” Explore metadata via information_schema (tables/columns) or psql meta-commands (\dt, \d ghs, \dn, \l). SHOW search_path; reveals the active schema order.

📋 Summary Table — Operators, Functions & Patterns

Abbreviation used below: ghs = public.global_health_statistics

Operator / Function / Pattern	What it does	Tiny example (PostgreSQL)
`SELECT ... FROM`	Retrieve columns/rows	`SELECT country, year_event FROM ghs;`
`WHERE`	Filter rows	`... FROM ghs WHERE year_event >= 2018;`
`ORDER BY`	Sort output	`... ORDER BY incidence DESC;`
`LIMIT` / `OFFSET`	Cap rows / paginate	`... ORDER BY dalys DESC LIMIT 10;`
`DISTINCT`	Remove duplicates	`SELECT DISTINCT disease_category FROM ghs;`
`AS` (aliases)	Rename columns/tables	`SELECT dalys AS total_dalys FROM ghs;`
Comparison ops	`=`, `<>`, `<`, `<=`, `>`, `>=`	`WHERE country = 'Australia';`
Logical ops	`AND`, `OR`, `NOT`	`WHERE year_event>=2015 AND dalys>0;`
Membership	`IN (...)`	`WHERE country IN ('Australia','USA');`
Range	`BETWEEN a AND b`	`WHERE year_event BETWEEN 2015 AND 2024;`
Pattern match	`LIKE`, `ILIKE`	`WHERE disease_name ILIKE 'influenza%';`
Regex (PG)	`~`, `~*`	`WHERE disease_name ~* 'flu\|covid';`
Null checks	`IS NULL`, `IS NOT NULL`	`WHERE incidence IS NULL;`
String concat	`\|\|`	`SELECT country\|\|' - '\|\|disease_name FROM ghs;`
Casting	`::type` or `CAST()`	`SELECT incidence::numeric FROM ghs;`
Aggregates	`COUNT/SUM/AVG/MIN/MAX`	`SELECT SUM(dalys) FROM ghs;`
`GROUP BY`	Define groups	`... GROUP BY country, year_event;`
`HAVING`	Filter groups	`... GROUP BY country HAVING SUM(dalys)>0;`
JOIN (all)	Combine tables: `INNER/LEFT/RIGHT/FULL/CROSS`	`SELECT g.country,r.region FROM ghs g LEFT JOIN country_regions r ON r.country=g.country;`
Subquery (IN)	Filter by another query	`WHERE country IN (SELECT country FROM focus_list);`
CTE (`WITH`)	Name intermediate result	`WITH recent AS (SELECT * FROM ghs WHERE year_event>=2020) SELECT * FROM recent;`
Set ops	`UNION`, `UNION ALL`, `INTERSECT`, `EXCEPT`	`(SELECT country FROM ghs WHERE dalys>0) INTERSECT (SELECT country FROM ghs WHERE incidence>0);`
Window `OVER(...)`	Rank/totals per partition	`SUM(dalys) OVER (PARTITION BY country);`
Ranking	`ROW_NUMBER`, `RANK`	`RANK() OVER (PARTITION BY country ORDER BY SUM(dalys) DESC);`
LAG/LEAD	Prior/next row value	`LEAD(incidence) OVER (PARTITION BY country ORDER BY year_event);`
`CASE`	Conditional logic	`CASE WHEN mortality_pct>5 THEN 'High' ELSE 'Low' END;`
`COALESCE` / `NULLIF`	Null handling & safe division	`COALESCE(incidence,0)/NULLIF(population_affected,0);`
Strings	`UPPER/LOWER/TRIM/LENGTH/REPLACE`	`SELECT UPPER(country), LENGTH(disease_name) FROM ghs;`
Dates	`NOW/EXTRACT/DATE_TRUNC/MAKE_DATE`	`DATE_TRUNC('year', MAKE_DATE(year_event,1,1));`
DML	`INSERT/UPDATE/DELETE/MERGE`	`UPDATE ghs SET incidence=0 WHERE incidence IS NULL;`
DDL	`CREATE/ALTER/DROP TABLE`	`ALTER TABLE ghs ADD COLUMN notes TEXT;`
Constraints	`PRIMARY KEY/UNIQUE/CHECK/NOT NULL`	`ALTER TABLE ghs ADD CHECK (mortality_pct BETWEEN 0 AND 100);`
Indexes	`CREATE INDEX`	`CREATE INDEX ON ghs(country, year_event);`
Transactions	`BEGIN/COMMIT/ROLLBACK/SAVEPOINT`	`BEGIN; UPDATE ghs SET dalys=0; ROLLBACK;`
Security	`GRANT/REVOKE`	`GRANT SELECT ON ghs TO analyst_role;`
Load/Export	`COPY` / `\copy`	`COPY ghs FROM 'C:/temp/file.csv' WITH (CSV,HEADER);`
Introspection	`information_schema`, psql `\d`	`SELECT column_name FROM information_schema.columns WHERE table_name='global_health_statistics';`
Comments	`--`, `/* ... */`	`-- count rows`

Notes to Keep Handy

Prefer explicit joins with clear keys; avoid ambiguous SELECT * in production code.
List columns in COPY for stable mapping if CSV order changes.
Use forward slashes on Windows paths with COPY (C:/...).
Create composite indexes to match common filters (e.g., (country, year_event)).
Always sanity-check results with quick counts/limits before heavy analysis.

Part II — Connecting with R, Rmarkdown, Quarto

Why connect SQL with R (and publish via .Rmd/.qmd)?

Modern analytics benefits from using SQL + R together: - SQL pushes heavy filtering/aggregation to the database (fast, scalable, auditable). - R adds wrangling, visualization, statistics, and reproducible documents.

Why use R Markdown (.Rmd) or Quarto (.qmd):
- Reproducibility: prose + code + outputs in one file; anyone can re-run end to end.
- Transparency: every figure or table is tied to the exact SQL/R code that produced it.
- Shareability: render to HTML/PDF/Word for the web or submission.
- Multi-language: mix R chunks and SQL chunks; SQL runs through a live DBI connection created in R.
- Automation: parameterized reports, scheduled renders, version control.

How SQL chunks work in Quarto/R Markdown
1) Create a DBI connection object in an R chunk (e.g., con <- dbConnect(...)).
2) Use SQL chunks with connection = con to run SQL on that database.
3) Outputs (tables) are printed inline under each SQL chunk.

🔐 Security tip: Avoid committing passwords to public repos.

Execution order & prerequisites:
- Install required R packages (DBI, RSQLite, RPostgres, readr, tidyverse, dbplyr).
- Ensure your database server (PostgreSQL) is running when connecting to it.
- Verify file paths (e.g., C:/temp/Global_Health_Statistics.csv) exist on your machine.
- Render top-to-bottom so the connection objects exist before SQL chunks that use them.

R Markdown & Quarto Options (R + SQL)

This section summarizes chunk options and syntax when building reports that mix R chunks and SQL chunks (e.g., ``{sql, connection = con}). The goal is to control execution, display, and reproducibility, while running queries directly against a database and printing results inline.

How SQL chunks work (at a glance) - Open a DBI connection in an R chunk (e.g., con <- dbConnect(...)). - Run queries in SQL chunks using connection = con. - Options like echo, eval, include, warning, message, error, and cache behave the same as in R chunks. - SQL chunk outputs render as tables; you can also capture them into R objects with output.var.

Summary: Quarto/R Markdown Chunk Options (R & SQL)

Feature	Quarto / knitr Syntax	Description (applies to R & SQL chunks)
Code chunk (R)	`{r, warning=FALSE, message=FALSE}`	Execute R code; suppress messages/warnings for clean output.
Code chunk (SQL)	`{sql, connection = con}`	Execute SQL against a DBI connection object created in an R chunk.
Show/hide code	`echo: true/false`	Display the source code (`true`) or hide it (`false`).
Run/skip code	`eval: true/false`	Execute the chunk or show code without running it.
Hide code & output	`include: false`	Run side effects (e.g., temp tables) but don’t print code or results.
Control results	`results: 'markup'\|'asis'\|'hide'`	Choose how printed results appear.
Messages & warnings	`message: false`, `warning: false`	Suppress messages/warnings in output.
Error handling	`error: true/false`	If `true`, continue rendering and show the error text.
Caching	`cache: true`	Reuse results when inputs unchanged (be careful if DB data changes).
Chunk labels	`{r my-label}`, `{sql my-label, connection = con}`	Name chunks for organization, cross-refs, and cache keys.
Figures (R)	`fig.width`, `fig.height`, `fig.align`	Size/alignment for plots created in R chunks.
Table printing	YAML `df-print: paged`	Global HTML table style; affects SQL outputs too.
Output formats	YAML `format:`	Select `html`, `pdf`, `docx`, etc., plus format-specific options.

SQL–specific conveniences

SQL Feature	Syntax	What it does
Bind a connection	`{sql, connection = con}`	Runs the SQL chunk using the DBI connection defined in an R chunk.
Capture result	`output.var = 'df'`	Saves the query output into an R object (e.g., `df`).
Show query only	`eval: false`	Display the SQL but do not execute (useful for teaching/drafts).
Show result only	`echo: false`	Execute query and print just the table (hide the SQL text).
Hidden setup	`include: false`	Execute without showing anything (e.g., temp objects).
Resilient to errors	`error: true`	Don’t halt the build on a chunk error (prints the error).

Examples

# DBI connection created once and reused by SQL chunks
library(DBI); 
library(RPostgres)
con <- dbConnect(
  RPostgres::Postgres(),
  dbname = "SQL_Data_Manipulation",
  host   = "localhost", port = 5432,
  user   = "postgres",  password = Sys.getenv("PGPASSWORD")
)

Abbreviations & acronyms used in this section

SQL — Structured Query Language
RDBMS — Relational Database Management System
ACID — Atomicity, Consistency, Isolation, Durability
DBaaS — Database as a Service (e.g., RDS/Azure/GCP)
GUI — Graphical User Interface (e.g., pgAdmin)
CLI — Command-Line Interface (e.g., psql)
CTE — Common Table Expression (WITH, WITH RECURSIVE)
DDL — Data Definition Language (CREATE/ALTER/DROP)
DML — Data Manipulation Language (SELECT/INSERT/UPDATE/DELETE/MERGE)
DCL — Data Control Language (GRANT/REVOKE)
PK — Primary Key
FK — Foreign Key
MV — Materialized View
JSON/JSONB — JavaScript Object Notation (text/binary)
GIN — Generalized Inverted Index (commonly for jsonb)
WAL — Write-Ahead Log (durability in Postgres)
QMD/RMD — Quarto/ R Markdown documents

Part III — Data Loading & Exploration

Table creation (DDL) & data types
CSV import: COPY / \copy and permissions
Basic exploration: previews, counts, distincts, NULL checks

Create the Project Database (with Australian locale)

We’ll call the project SQL_Data_Manipulation.

Note: You created the database with quoted/mixed case. In PostgreSQL, database names are case-sensitive if created with quotes. Always use the exact casing SQL_Data_Manipulation in tools/connectors.

Run this from the postgres database (or any admin DB) in pgAdmin Query Tool:

-- Create database with English_Australia.1252 locale (Windows)
-- If it already exists, skip this block.
CREATE DATABASE "SQL_Data_Manipulation"
WITH
OWNER = postgres
ENCODING = 'UTF8'
LC_COLLATE = 'English_Australia.1252'
LC_CTYPE   = 'English_Australia.1252'
LOCALE_PROVIDER = 'libc'
TABLESPACE = pg_default
CONNECTION LIMIT = -1
IS_TEMPLATE = FALSE;

COMMENT ON DATABASE "SQL_Data_Manipulation"
IS 'This is the DB for SQL data manipulation project.';

After creation: In pgAdmin, right-click your server → Refresh, then select database SQL_Data_Manipulation and open Query Tool from that database node so subsequent SQL runs in the right place.

💡 Locale note (Windows): English_Australia.1252 controls collation/ctype (sorting/character classification). If creation fails, verify that locale is installed on the OS.

Expand your server → right-click Databases → Create > Database…
Database: SQL_Data_Manipulation_Guide → Save.

(Optional) Set this database as the Query Tool target using the dropdown in the toolbar.

🧱 Create the Target Table (in SQL_Data_Manipulation)

-- Ensure you're connected to the "SQL_Data_Manipulation" database before running this.
CREATE TABLE IF NOT EXISTS public.global_health_statistics (
    country TEXT,
    year_event INT,
    disease_name TEXT,
    disease_category TEXT,
    prevalence NUMERIC,
    incidence NUMERIC,
    mortality_pct NUMERIC,
    age_group TEXT,
    gender TEXT,
    population_affected BIGINT,
    healthcare_access TEXT,
    dr_per_1000 NUMERIC,
    hospital_beds_per_1000 NUMERIC,
    treatment_type TEXT,
    average_treatment_cost_usd NUMERIC,
    availability_of_vaccines_treatment TEXT,
    recovery_rate_pct NUMERIC,
    dalys NUMERIC,
    improvement_in_5_years_pct NUMERIC,
    per_capita_income_usd NUMERIC,
    education_index NUMERIC,
    urbanization_rate_pct NUMERIC
);

📥 Importing Data — Options and Tips

Importing Data Files into SQL

Before analyzing data in SQL, we need to bring it into a database table.
Unlike R or Python, SQL engines do not read files directly; instead, they rely on database commands (COPY, \copy, BULK INSERT, or external tools) to import files into a table.

Below is a summary of how different file types can be imported into common SQL engines.

Data Type	SQL Import Method	Database	Example
CSV	`COPY` or `\copy` (Postgres), `BULK INSERT` (SQL Server), `.import` (SQLite)	PostgreSQL, SQLite, SQL Server	`COPY mytable FROM 'file.csv' CSV HEADER;`
TSV / Text (.tsv)	Same as CSV but with tab delimiter specified	PostgreSQL, SQLite	`COPY mytable FROM 'file.tsv' DELIMITER E'\t' CSV HEADER;`
Excel (.xls/.xlsx)	Convert to CSV first, then import (native SQL cannot parse Excel directly)	All	Save as CSV → `COPY` into SQL
ZIP archives	Extract file first, then import csv into SQL	All	`unzip data.zip` → `COPY mytable FROM 'file.csv' CSV HEADER;`
Stata/SPSS (.dta/.sav)	Convert to CSV using R/Python/StatTransfer, then load into SQL	All	R: `write.csv(read_dta("file.dta"), "file.csv")` → SQL `COPY`
Online files	Download first (with `wget`/`curl`), then import csv	All	`wget URL -O file.csv` → SQL `COPY`

Two ways to load CSV data:

Option A — pgAdmin GUI (Import/Export)

Right-click table → View/Edit Data → All Rows

Click Import/Export (📥)
Import tab: choose file, format CSV, check Header, set delimiter ,

**OK**

Option B — SQL Import

There are two SQL variants:

B1. Server-side COPY (runs on the DB server process)

Works in pgAdmin Query Tool.

The server must be able to read the file path. On Windows, the service account often cannot read Desktop/Documents.

Fix: move the file to a permissive folder like C:/temp/.

COPY public.global_health_statistics (
  country, year_event, disease_name, disease_category, prevalence, incidence, mortality_pct, 
  age_group, gender, population_affected, healthcare_access, dr_per_1000,
  hospital_beds_per_1000, treatment_type, average_treatment_cost_usd,
  availability_of_vaccines_treatment, recovery_rate_pct, dalys,
  improvement_in_5_years_pct, per_capita_income_usd, education_index, urbanization_rate_pct
)
FROM 'C:/temp/Global_Health_Statistics.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',', QUOTE '"', ESCAPE '"', ENCODING 'UTF8');

B2. Client-side (psql)

Only in psql (ie Use SQL Shell), not pgAdmin.

Avoids server-permission issues (can read Desktop paths).

\copy public.global_health_statistics (
  country, year_event, disease_name, disease_category, prevalence, incidence, mortality_pct, 
  age_group, gender, population_affected, healthcare_access, dr_per_1000,
  hospital_beds_per_1000, treatment_type, average_treatment_cost_usd,
  availability_of_vaccines_treatment, recovery_rate_pct, dalys,
  improvement_in_5_years_pct, per_capita_income_usd, education_index, urbanization_rate_pct
)
FROM 'C:/Users/User/Desktop/Materials_ Course and proposals/Course Related/Data_source/Global_Health_Statistics.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',', QUOTE '"', ESCAPE '"', ENCODING 'UTF8');

Tips (Windows paths)

Prefer forward slashes (C:/…) to avoid escaping backslashes.

If permission denied appears with COPY, move the file to C:/temp/ or use .

Method used in this project

Used: Option B1 — server-side COPY from C:/temp/Global_Health_Statistics.csv.

Result: COPY 1000000 → 1,000,000 rows imported successfully.

Where to See the Imported Data (Names & Columns)

In pgAdmin: Servers → Databases → SQL_Data_Manipulation → Schemas → public → Tables → global_health_statistics

Right-click → View/Edit Data → All Rows to preview.

Select the table → Columns tab (right pane) to see column names and types.

1). Importing data to R from local path

This chunk reads the CSV into R, opens a SQLite database file named SQL_Data_Manipulation, and writes the data as a SQLite table. It’s a lightweight, local option for quick iteration (no server required).

    # Clear environment and console
    rm(list = ls()) # To clear the environment (remove all objects in memory)
    cat("\014") # To clear the console (like wiping a whiteboard)

library(DBI)
library(RSQLite)
library(readr)
library(tidyverse)
library(dbplyr)

# Read CSV
global_health_statistics <- read_csv("C:/temp/Global_Health_Statistics.csv")

# Connect to SQLite
con <- dbConnect(RSQLite::SQLite(), dbname = "SQL_Data_Manipulation")

# Write table to DB
dbWriteTable(con, "global_health_statistics", global_health_statistics, overwrite = TRUE)

# knitr::opts_chunk$set(connection = "con")

Use SQL codes: from here and onwards

The next SQL chunk runs against the current con connection (SQLite here) and previews rows.

SELECT * FROM global_health_statistics LIMIT 10;

Displaying records 1 - 10
country	year_event	disease_name	disease_category	prevalence	incidence	mortality_pct	age_group	gender	population_affected	healthcare_access	dr_per_1000	hospital_beds_per_1000	treatment_type	average_treatment_cost_usd	availability_of_vaccines_treatment	recovery_rate_pct	dalys	improvement_in_5_years_pct	per_capita_income_usd	education_index	urbanization_rate_pct
Italy	2013	Malaria	Respiratory	0.95	1.55	8.42	0-18	Male	471007	57.74	3.34	7.58	Medication	21064	No	91.82	4493	2.16	16886	0.79	86.02
France	2002	Ebola	Parasitic	12.46	8.63	8.75	61+	Male	634318	89.21	1.33	5.11	Surgery	47851	Yes	76.65	2366	4.82	80639	0.74	45.52
Turkey	2015	COVID-19	Genetic	0.91	2.35	6.22	36-60	Male	154878	56.41	4.07	3.49	Vaccination	27834	Yes	98.55	41	5.81	12245	0.41	40.20
Indonesia	2011	Parkinson’s Disease	Autoimmune	4.68	6.29	3.99	0-18	Other	446224	85.20	3.18	8.44	Surgery	144	Yes	67.35	3201	2.22	49336	0.49	58.47
Italy	2013	Tuberculosis	Genetic	0.83	13.59	7.01	61+	Male	472908	67.00	4.61	5.90	Medication	8908	Yes	50.06	2832	6.93	47701	0.50	48.14
Saudi Arabia	2011	Dengue	Bacterial	10.99	6.49	4.64	61+	Female	479234	98.41	3.50	0.62	Therapy	42671	Yes	93.17	416	9.83	29597	0.46	56.50
USA	2013	Malaria	Cardiovascular	18.42	6.33	9.33	61+	Female	289931	90.03	3.16	3.31	Surgery	15579	No	92.80	4535	0.89	60027	0.70	20.48
Nigeria	2007	Tuberculosis	Neurological	3.48	5.71	1.21	0-18	Female	393296	75.60	0.54	3.54	Medication	15744	Yes	65.45	4584	9.81	23222	0.46	66.49
Italy	2000	Rabies	Chronic	15.59	4.74	6.38	19-35	Female	253110	87.87	4.56	5.84	Therapy	7669	Yes	59.23	2253	9.92	30849	0.55	41.27
Australia	2006	Cholera	Chronic	10.12	2.08	6.00	61+	Male	174143	95.90	4.63	6.01	Medication	9468	Yes	93.21	4694	2.96	68856	0.90	83.30

SELECT 
  COUNT(*) FILTER (WHERE country IS NULL)       AS null_country,
  COUNT(*) FILTER (WHERE disease_name IS NULL)  AS null_disease
FROM global_health_statistics;

1 records
null_country	null_disease
0	0

2). Reading the data available at postgres in SQL Via R

This section opens a PostgreSQL connection using RPostgres to the database “SQL_Data_Manipulation”. Best practice is to keep credentials out of code (use env vars or keychain).

# install.packages(c("DBI","RPostgres","dplyr"))  # run once
library(DBI)
library(RPostgres)
library(dplyr)

# (Optional) keep credentials out of the script
Sys.setenv(PGUSER = "postgres", PGPASSWORD = "yourpassword")

con <- dbConnect(
  RPostgres::Postgres(),
  host   = "localhost",
  port   = 5432,
  dbname = "SQL_Data_Manipulation",   # matches your screenshot
  user   = Sys.getenv("PGUSER", "postgres"),
  password = Sys.getenv("PGPASSWORD", "yourpassword")
)

Pull the table from PostgreSQL into an R data frame (or use dbGetQuery to run a SQL SELECT).

global_health_statistics <- dbReadTable(con, Id(schema = "public", table = "global_health_statistics"))
# or: ghs <- dbGetQuery(con, "SELECT * FROM public.global_health_statistics;")

Reading and Importing Data

In SQL, importing means loading external files into a database table so they can be queried.
Unlike R, where we use functions like read_csv() or read_excel(), in SQL we rely on commands such as COPY, LOAD DATA INFILE, or BULK INSERT depending on the database system.

📊 Reading Common Data Types in SQL

Data Type	SQL Command / Method	Notes	Example
CSV	`COPY` (Postgres), `LOAD DATA INFILE` (MySQL), `BULK INSERT` (SQL Server)	Most common tabular format	`COPY patients FROM '/path/patients.csv' CSV HEADER;`
Excel (.xls/.xlsx)	Not directly supported	Convert to CSV or import via ETL tools
Tab-separated (.tsv)	Same as CSV with delimiter `\t`		`COPY patients FROM '/path/file.tsv' DELIMITER E'\t' CSV HEADER;`
Text (.txt)	Load as delimited file	Need to define delimiter/schema
Stata (.dta), SPSS (.sav), RDS	Not supported natively	Convert to CSV first (R/Python/SAS)
ZIP archives	SQL cannot unzip	Unzip externally, then import CSV
Online CSV/Excel	Some cloud DBs (BigQuery, Snowflake) allow URL loading	Otherwise: download first	`LOAD DATA ... FROM 'https://...'`

Use SQL codes: from here and on wards: Run a SQL query against PostgreSQL

This demonstration shows how I use SQL to explore and validate a large dataset on global health statistics. Each section introduces the query with a short explanation, followed by the SQL code and notes about the expected output.

Data Exploration

Before analysis, it is important to preview the dataset to confirm structure and column names.
The following query selects the first 10 rows for each variable country, year_event, disease_name, prevalence, incidence, mortality_pct, and socio-economic indicators.

-- check some Sample rows
SELECT *
FROM public.global_health_statistics
LIMIT 10;

Displaying records 1 - 10
country	year_event	disease_name	disease_category	prevalence	incidence	mortality_pct	age_group	gender	population_affected	healthcare_access	dr_per_1000	hospital_beds_per_1000	treatment_type	average_treatment_cost_usd	availability_of_vaccines_treatment	recovery_rate_pct	dalys	improvement_in_5_years_pct	per_capita_income_usd	education_index	urbanization_rate_pct
China	2015	Hepatitis	Genetic	19.87	3.07	8.54	0-18	Male	979238	71.63	1.71	1.16	Therapy	33951	Yes	51.60	980	1.62	74848	0.84	51.33
Australia	2003	Asthma	Bacterial	16.47	5.11	7.73	0-18	Female	983250	58.48	4.22	1.08	Therapy	18945	Yes	74.70	1794	7.15	59994	0.54	44.09
Germany	2003	Dengue	Neurological	0.50	3.89	9.19	19-35	Male	995467	53.86	2.98	8.98	Therapy	16993	No	69.58	2569	0.05	16782	0.40	52.60
China	2002	Malaria	Autoimmune	18.38	3.81	6.77	19-35	Female	990949	62.36	1.66	9.68	Therapy	17764	Yes	97.23	3047	5.84	89944	0.81	21.99
Russia	2007	COVID-19	Bacterial	15.03	12.88	3.54	61+	Other	996438	88.06	4.02	2.22	Vaccination	29279	No	56.19	4102	9.71	43114	0.49	57.31
Russia	2012	Polio	Bacterial	16.17	9.98	3.28	19-35	Other	998885	80.62	3.92	3.00	Vaccination	34751	Yes	89.90	1876	8.80	40763	0.45	54.51
South Africa	2000	Hypertension	Neurological	3.53	5.05	5.89	19-35	Male	977922	62.92	3.59	9.15	Therapy	44512	Yes	92.83	1045	5.52	57064	0.50	50.14
South Korea	2009	Cholera	Bacterial	17.03	14.88	5.54	36-60	Male	993314	96.23	2.55	6.10	Therapy	25692	Yes	73.43	180	7.13	60687	0.53	30.15
Turkey	2017	Leprosy	Bacterial	18.75	8.30	1.51	61+	Other	972077	75.82	4.16	8.22	Medication	2330	Yes	58.85	1940	8.76	96878	0.68	62.72
Indonesia	2019	Dengue	Respiratory	15.21	3.26	9.92	61+	Male	991463	69.99	1.58	1.45	Therapy	23547	Yes	50.35	994	2.74	71648	0.60	32.33

Confirm how many rows exist in the table: In this case: 1,000,000 rows.

-- Total rows loaded
SELECT COUNT(*) AS rows_loaded
FROM public.global_health_statistics;

1 records
rows_loaded
10000

Check the earliest and latest years in the dataset: spans from 2000 to 2024.

SELECT MIN(year_event) AS min_year, MAX(year_event) AS max_year
FROM public.global_health_statistics;

1 records
min_year	max_year
2000	2024

Look at the most recent events by sorting years in descending order:

SELECT country, year_event, incidence
FROM public.global_health_statistics
ORDER BY year_event DESC
LIMIT 10;

Displaying records 1 - 10
country	year_event	incidence
Indonesia	2024	9.01
Germany	2024	11.98
France	2024	12.53
Canada	2024	10.95
Nigeria	2024	5.49
Brazil	2024	7.82
Russia	2024	4.51
Japan	2024	3.38
UK	2024	10.58
Saudi Arabia	2024	4.10

Checking Data Completeness

Check for missing values at the PostgreSQL level: Both null_country and null_disease return 0.

SELECT 
  COUNT(*) FILTER (WHERE country IS NULL)       AS null_country,
  COUNT(*) FILTER (WHERE disease_name IS NULL)  AS null_disease
FROM public.global_health_statistics;

1 records
null_country	null_disease
0	0

Creating a Basic Data Dictionary

-- PostgreSQL example: auto-generate a simple data dictionary
SELECT 
    table_name,
    column_name,
    data_type,
    is_nullable
FROM information_schema.columns
WHERE table_schema = 'public'
  AND table_name = 'patients';

0 records
table_name	column_name	data_type	is_nullable

📊 Summary Table: SQL Equivalents

The table below provides SQL equivalents for common R exploration tasks.
It shows the task, the general SQL approach, and an example query.

Task	SQL Approach	Example
View structure of dataset	Query metadata	`SELECT column_name, data_type FROM information_schema.columns`
Summarize statistics	Aggregate functions	`MIN(), MAX(), AVG(), STDDEV()`
Preview first rows	`LIMIT`	`SELECT * FROM patients LIMIT 10;`
Preview last rows	`ORDER BY ... DESC LIMIT n`	`SELECT * FROM patients ORDER BY id DESC LIMIT 8;`
Count frequencies	`GROUP BY + COUNT()`	`SELECT gender, COUNT(*) FROM patients GROUP BY gender;`
Count rows	`COUNT(*)`	`SELECT COUNT(*) FROM patients;`
Count columns	Metadata query	`SELECT COUNT(*) FROM information_schema.columns;`
List variable names	Metadata query	`SELECT column_name FROM information_schema.columns;`
Check variable type	Metadata query	`SELECT column_name, data_type FROM information_schema.columns;`
Create data dictionary	Metadata query	(see metadata query example above)
Clean column/text values	SQL functions	`UPPER(TRIM(col))`

Part IV — Wrangling Core (SELECT / WHERE / CASE / JOIN)

Once a dataset has been imported and explored, the next step is wrangling — reshaping and preparing the data for analysis.
In SQL, the core tools for wrangling are:

Selecting & aliasing (choosing columns, renaming for clarity, creating computed columns)
Filtering, sorting, limiting (subsetting rows)
Conditional logic (CASE, COALESCE, NULLIF)
Joins (combining data across multiple tables)

📊 Summary Table — Conditional Logic & Joins

The table below summarizes key SQL wrangling patterns using the global_health_statistics dataset.
It highlights creating variables with CASE, handling missing values, and common join strategies, with representative examples.

Concept	Description	Example SQL
Binary indicator with CASE	Recode into two groups (like `ifelse()` in R).	`sql<br>SELECT disease_name, mortality_pct,<br> CASE WHEN mortality_pct > 20 THEN 'High Mortality'<br> ELSE 'Low/Moderate Mortality' END AS mortality_group<br>FROM public.global_health_statistics<br>LIMIT 10;`
Multi-category recoding	Multiple conditions, like `case_when()` in R.	`sql<br>SELECT disease_name, mortality_pct,<br> CASE WHEN mortality_pct < 5 THEN 'Low'<br> WHEN mortality_pct BETWEEN 5 AND 20 THEN 'Moderate'<br> WHEN mortality_pct > 20 THEN 'High'<br> ELSE 'Unknown' END AS mortality_category<br>FROM public.global_health_statistics<br>LIMIT 10;`
Buckets based on numeric ranges	Group continuous values into buckets.	`sql<br>SELECT country, disease_name, year_event, population_affected,<br> CASE WHEN population_affected IS NULL THEN 'unknown'<br> WHEN population_affected < 1_000 THEN 'very low'<br> WHEN population_affected < 10_000 THEN 'low'<br> WHEN population_affected < 100_000 THEN 'moderate'<br> WHEN population_affected < 1_000_000 THEN 'high'<br> ELSE 'very high' END AS burden_bucket<br>FROM public.global_health_statistics<br>ORDER BY year_event DESC, population_affected DESC<br>LIMIT 10;`
Combine multiple variables	Conditional logic across >1 column.	`sql<br>SELECT disease_name, age_group,<br> CASE WHEN disease_category = 'Infectious' AND age_group = '0-14' THEN 'Child Infectious'<br> WHEN disease_category = 'Infectious' AND age_group = '65+' THEN 'Elderly Infectious'<br> WHEN disease_category = 'Non-communicable' AND age_group = '15-64' THEN 'Adult NCD'<br> ELSE 'Other' END AS combined_category<br>FROM public.global_health_statistics<br>LIMIT 10;`
Handle missing values	`COALESCE` replaces NULL, `NULLIF` avoids divide-by-zero.	`sql<br>SELECT country, COALESCE(healthcare_access,'Unknown') AS healthcare_access<br>FROM public.global_health_statistics<br>LIMIT 10;`
LEFT JOIN	Keep all fact rows, enrich where possible.	`sql<br>SELECT g.country, COALESCE(cr.region,'Unknown') AS region,<br> g.disease_name, g.year_event, g.population_affected<br>FROM public.global_health_statistics g<br>LEFT JOIN country_region cr ON cr.country = g.country<br>WHERE g.year_event BETWEEN 2018 AND 2022<br>ORDER BY g.year_event DESC, region, g.population_affected DESC<br>LIMIT 15;`
FULL JOIN	Keep all rows from both tables; useful for reconciliation.	`sql<br>SELECT COALESCE(g.country, cr.country) AS country_key,<br> cr.region, g.disease_name, g.year_event, g.population_affected,<br> CASE WHEN g.country IS NULL THEN 'missing_in_facts'<br> WHEN cr.country IS NULL THEN 'missing_in_dim'<br> ELSE 'matched' END AS join_status<br>FROM public.global_health_statistics g<br>FULL OUTER JOIN country_region cr ON cr.country = g.country<br>LIMIT 10;`
Anti-join	Find fact rows with no matching dimension.	`sql<br>SELECT g.*<br>FROM public.global_health_statistics g<br>WHERE NOT EXISTS (SELECT 1 FROM country_region cr WHERE cr.country = g.country)<br>LIMIT 10;`
Semi-join	Keep fact rows that have a match, without returning dimension columns.	`sql<br>SELECT g.country, g.disease_name, g.year_event, g.population_affected<br>FROM public.global_health_statistics g<br>WHERE EXISTS (SELECT 1 FROM country_region cr WHERE cr.country = g.country)<br>LIMIT 10;`

For demonstration, I will continue using the global_health_statistics dataset.

1. Selecting & Aliasing

Aliasing allows us to rename columns or create new computed fields directly in queries.

Here, I renamed columns (country → country_name) and created a computed column to express affected population in millions. This makes query results easier to interpret when preparing reports.

-- Select key columns with aliases and a computed column
SELECT 
    country AS country_name,
    disease_name AS disease,
    year_event AS year,
    population_affected,
    ROUND(CAST(population_affected AS numeric) / 1000000, 2) AS population_millions
FROM public.global_health_statistics
LIMIT 10;

Displaying records 1 - 10
country_name	disease	year	population_affected	population_millions
Saudi Arabia	Parkinson’s Disease	2005	994863	0.99
Australia	Measles	2003	967319	0.97
Italy	Ebola	2003	983403	0.98
India	Ebola	2016	988479	0.99
Turkey	Hepatitis	2001	995574	1.00
Australia	Asthma	2007	997674	1.00
Saudi Arabia	Zika	2021	974805	0.97
USA	Cholera	2020	997379	1.00
Saudi Arabia	HIV/AIDS	2012	984721	0.98
UK	COVID-19	2023	999050	1.00

2. Filtering, Sorting, Limiting

Filtering rows focuses on subsets of interest. Sorting orders results, and limiting restricts the number of records returned.

The code below extracts the countries most affected by Tuberculosis after 2015, ranked by population impacted. Filtering with WHERE, ordering with ORDER BY, and subsetting with LIMIT form the core of SQL wrangling.

-- Top 10 most affected records for Tuberculosis after 2015
SELECT 
    country,
    year_event,
    disease_name,
    population_affected
FROM public.global_health_statistics
WHERE disease_name = 'Tuberculosis'
  AND year_event > 2015
ORDER BY population_affected DESC
LIMIT 10;

Displaying records 1 - 10
country	year_event	disease_name	population_affected
USA	2021	Tuberculosis	999879
South Korea	2022	Tuberculosis	999833
UK	2019	Tuberculosis	999818
South Africa	2018	Tuberculosis	999784
Indonesia	2018	Tuberculosis	999769
South Korea	2019	Tuberculosis	999739
South Korea	2024	Tuberculosis	999711
India	2016	Tuberculosis	999603
India	2023	Tuberculosis	999487
Italy	2022	Tuberculosis	999477

When reading results, you may prefer explicit predicates, deterministic ordering, and tight limits while iterating.

--explicit predicates, deterministic ordering, and tight limits while iterating.
SELECT country, disease_name, year_event, population_affected
FROM public.global_health_statistics
WHERE year_event BETWEEN 2010 AND 2020             -- time window
  AND disease_name ILIKE '%influenza%'             -- case-insensitive contains
  AND population_affected IS NOT NULL              -- avoid surprises in math
ORDER BY year_event DESC, population_affected DESC -- deterministic, business-first order
LIMIT 10;

Displaying records 1 - 10
country	disease_name	year_event	population_affected
Brazil	Influenza	2020	999885
UK	Influenza	2020	999365
South Korea	Influenza	2020	999300
China	Influenza	2020	998855
Nigeria	Influenza	2020	996164
France	Influenza	2020	995307
Italy	Influenza	2020	994765
Indonesia	Influenza	2020	994646
Germany	Influenza	2020	994263
Australia	Influenza	2020	992359

Filtering by specific letters, words or names
Filtering with LIKE and ILIKE

In PostgreSQL you can use ILIKE instead of LIKE to make the search case-insensitive.

Filtering text values is a common requirement in health datasets.
Often we need to retrieve records where a disease name, country, or other text field contains, starts with, or ends with a certain word.

In SQL, this is achieved using the LIKE operator.
- % matches any sequence of characters (including none).
- _ matches a single character.

Below is a summary table of common filtering patterns:

Pattern	Description	SQL Example
`%Alzheimer%`	Matches any value containing “Alzheimer”	`SELECT * FROM public.global_health_statistics WHERE disease_name LIKE '%Alzheimer%';`
`Alzheimer%`	Matches values that start with “Alzheimer”	`SELECT * FROM public.global_health_statistics WHERE disease_name LIKE 'Alzheimer%';`
`%Alzheimer`	Matches values that end with “Alzheimer”	`SELECT * FROM public.global_health_statistics WHERE disease_name LIKE '%Alzheimer';`
`%flu%`	Matches any value containing “flu” (e.g. “Influenza”, “Avian flu”)	`SELECT * FROM public.global_health_statistics WHERE disease_name LIKE '%flu%';`
`_alaria`	Matches a 7-letter word ending in “alaria” (e.g. “Malaria”)	`SELECT * FROM public.global_health_statistics WHERE disease_name LIKE '_alaria';`
`C%`	Matches any value starting with the letter “C”	`SELECT * FROM public.global_health_statistics WHERE disease_name LIKE 'C%';`

– Find all records where the disease name starts with “Influenza”

SELECT *
FROM public.global_health_statistics
WHERE disease_name LIKE 'Influenza%';

Displaying records 1 - 10
country	year_event	disease_name	disease_category	prevalence	incidence	mortality_pct	age_group	gender	population_affected	healthcare_access	dr_per_1000	hospital_beds_per_1000	treatment_type	average_treatment_cost_usd	availability_of_vaccines_treatment	recovery_rate_pct	dalys	improvement_in_5_years_pct	per_capita_income_usd	education_index	urbanization_rate_pct
South Korea	2021	Influenza	Autoimmune	16.40	9.03	8.06	0-18	Female	991918	96.11	2.54	7.47	Medication	24235	Yes	84.39	1895	5.32	83332	0.62	41.45
Russia	2024	Influenza	Neurological	9.23	11.75	7.85	36-60	Male	995982	93.26	4.22	6.31	Vaccination	8893	No	79.07	2119	2.34	93844	0.66	74.08
Mexico	2007	Influenza	Autoimmune	14.52	14.32	9.96	0-18	Female	985023	89.24	3.35	3.56	Surgery	17797	Yes	67.76	1859	3.29	45837	0.63	31.86
Nigeria	2016	Influenza	Neurological	4.46	0.88	4.07	61+	Male	971438	99.91	1.35	5.23	Medication	16069	Yes	54.47	4392	3.98	42556	0.83	70.52
Russia	2005	Influenza	Chronic	11.17	5.11	4.62	61+	Other	978667	63.33	3.12	0.79	Vaccination	39282	Yes	50.16	1032	0.71	19698	0.78	37.30
Saudi Arabia	2016	Influenza	Respiratory	15.43	0.67	3.53	36-60	Female	993985	61.94	4.18	3.19	Vaccination	20822	No	68.19	3037	6.43	40410	0.65	75.44
Australia	2005	Influenza	Autoimmune	14.17	6.03	3.00	36-60	Other	998326	67.87	4.30	8.71	Medication	41420	Yes	73.73	3639	9.46	31151	0.52	56.19
Germany	2008	Influenza	Genetic	19.75	0.87	1.31	19-35	Other	976737	52.15	1.74	0.63	Medication	27479	No	84.86	2169	9.38	21133	0.89	80.84
France	2014	Influenza	Infectious	5.38	13.02	4.03	61+	Female	966348	52.44	4.15	9.89	Surgery	18901	Yes	75.84	990	0.95	81729	0.65	39.77
Brazil	2013	Influenza	Metabolic	13.00	3.53	8.28	36-60	Female	963595	61.27	2.40	4.45	Vaccination	36551	No	82.40	2619	1.95	47881	0.63	74.25

Use ILIKE instead of LIKE to make the search case-insensitive:

-- Case-insensitive match for "flu"
SELECT *
FROM public.global_health_statistics
WHERE disease_name ILIKE '%flu%';

Displaying records 1 - 10
country	year_event	disease_name	disease_category	prevalence	incidence	mortality_pct	age_group	gender	population_affected	healthcare_access	dr_per_1000	hospital_beds_per_1000	treatment_type	average_treatment_cost_usd	availability_of_vaccines_treatment	recovery_rate_pct	dalys	improvement_in_5_years_pct	per_capita_income_usd	education_index	urbanization_rate_pct
France	2000	Influenza	Neurological	10.98	12.50	6.16	19-35	Male	972289	90.77	0.81	8.97	Therapy	20977	Yes	68.52	240	7.06	50477	0.52	54.33
Saudi Arabia	2012	Influenza	Respiratory	1.42	3.61	6.36	61+	Other	996854	64.29	2.52	7.32	Therapy	27310	Yes	88.74	4757	0.45	18559	0.87	20.21
UK	2004	Influenza	Cardiovascular	5.65	7.87	0.45	61+	Male	982164	50.52	4.30	6.90	Vaccination	357	Yes	50.72	2576	8.51	96145	0.72	45.57
South Africa	2003	Influenza	Genetic	18.19	0.77	1.28	19-35	Male	984296	78.75	4.08	3.62	Vaccination	36018	Yes	94.75	1958	4.49	24740	0.69	45.43
Saudi Arabia	2011	Influenza	Neurological	18.44	7.17	8.03	36-60	Female	991688	75.60	4.46	2.18	Therapy	5946	No	63.11	2149	7.49	60796	0.51	49.90
Turkey	2019	Influenza	Bacterial	17.33	1.57	8.76	0-18	Female	990389	69.89	4.80	9.46	Therapy	5303	Yes	62.03	709	1.20	54889	0.49	29.12
Italy	2014	Influenza	Genetic	0.58	11.24	2.86	61+	Male	971317	91.63	2.32	8.15	Vaccination	48912	Yes	85.41	3653	8.92	53230	0.49	40.62
Germany	2024	Influenza	Chronic	15.96	2.96	4.53	0-18	Female	994346	64.58	4.77	7.13	Vaccination	43370	No	54.10	4054	6.88	72122	0.47	26.04
Australia	2016	Influenza	Parasitic	1.23	0.33	7.45	36-60	Other	996915	55.93	2.55	2.72	Therapy	42703	No	60.02	336	4.81	28099	0.66	85.06
China	2008	Influenza	Parasitic	12.37	2.22	4.60	19-35	Male	992246	58.01	4.08	4.98	Vaccination	4833	No	98.34	117	2.56	53237	0.59	78.22

-- Find all records where the disease name contains "Alzheimer"
SELECT *
FROM public.global_health_statistics
WHERE disease_name LIKE '%Alzheimer%';

Displaying records 1 - 10
country	year_event	disease_name	disease_category	prevalence	incidence	mortality_pct	age_group	gender	population_affected	healthcare_access	dr_per_1000	hospital_beds_per_1000	treatment_type	average_treatment_cost_usd	availability_of_vaccines_treatment	recovery_rate_pct	dalys	improvement_in_5_years_pct	per_capita_income_usd	education_index	urbanization_rate_pct
India	2022	Alzheimer’s Disease	Neurological	1.39	6.56	9.98	19-35	Female	993771	89.40	3.56	4.85	Therapy	34353	No	65.82	3919	3.99	46885	0.63	33.94
Nigeria	2002	Alzheimer’s Disease	Viral	15.92	12.36	6.57	19-35	Male	982502	76.19	2.47	6.80	Surgery	25648	Yes	98.90	1757	8.64	89792	0.79	46.24
Turkey	2017	Alzheimer’s Disease	Respiratory	15.07	14.96	8.67	0-18	Female	993217	79.87	2.17	2.43	Therapy	23187	No	71.03	2156	8.09	28630	0.56	83.17
Germany	2023	Alzheimer’s Disease	Viral	17.64	3.77	5.58	19-35	Other	982340	81.97	2.35	8.19	Therapy	45121	No	65.34	1861	3.76	74511	0.65	62.16
Japan	2020	Alzheimer’s Disease	Autoimmune	7.30	6.18	4.46	36-60	Female	994553	53.65	2.31	4.98	Surgery	36987	No	79.00	2286	2.89	80599	0.71	34.09
Japan	2015	Alzheimer’s Disease	Viral	14.21	8.87	8.83	61+	Other	987549	87.25	2.45	7.87	Vaccination	2382	No	95.62	1509	3.66	41121	0.82	68.78
Saudi Arabia	2005	Alzheimer’s Disease	Genetic	18.55	11.63	7.36	19-35	Other	989098	81.79	2.71	6.32	Surgery	38087	Yes	81.21	420	6.04	67373	0.74	32.60
USA	2012	Alzheimer’s Disease	Parasitic	15.02	0.81	6.39	0-18	Male	978985	82.86	3.58	2.62	Therapy	5799	Yes	71.14	1636	9.24	14485	0.63	33.76
Brazil	2014	Alzheimer’s Disease	Bacterial	9.72	0.82	4.23	19-35	Other	999386	99.29	3.65	1.96	Vaccination	32450	No	72.86	2464	4.40	18578	0.51	57.38
Indonesia	2011	Alzheimer’s Disease	Respiratory	14.59	1.88	1.37	0-18	Male	995566	63.94	3.33	5.84	Surgery	17167	Yes	70.50	1986	6.51	27794	0.89	73.16

Top-N per group (practical)

Top 3 countries per year by affected population.

WITH ranked AS (
  SELECT
    country,
    year_event,
    population_affected,
    ROW_NUMBER() OVER (
      PARTITION BY year_event
      ORDER BY population_affected DESC, country ASC
    ) AS rn
  FROM public.global_health_statistics
  WHERE population_affected IS NOT NULL
)
SELECT country, year_event, population_affected
FROM ranked
WHERE rn <= 3
ORDER BY year_event, rn, country;

Displaying records 1 - 10
country	year_event	population_affected
Australia	2000	999987
Brazil	2000	999982
Canada	2000	999970
China	2001	999989
Italy	2001	999986
Russia	2001	999979
UK	2002	999999
France	2002	999983
China	2002	999963
Japan	2003	999999

3. Creating variables and Conditional Logic

SQL lets us recode or transform values inline.

Using CASE

CASE is the SQL equivalent of ifelse() in R or if-else in Python, and it’s essential for creating categorical groupings.

-- Create a binary indicator for high mortality diseases
SELECT 
    disease_name,
    mortality_pct,
    CASE 
        WHEN mortality_pct > 20 THEN 'High Mortality'
        ELSE 'Low/Moderate Mortality'
    END AS mortality_group
FROM public.global_health_statistics
LIMIT 10;

Displaying records 1 - 10
disease_name	mortality_pct	mortality_group
Cancer	4.58	Low/Moderate Mortality
Hepatitis	3.00	Low/Moderate Mortality
Tuberculosis	7.15	Low/Moderate Mortality
HIV/AIDS	6.04	Low/Moderate Mortality
Hypertension	3.42	Low/Moderate Mortality
HIV/AIDS	1.83	Low/Moderate Mortality
Parkinson’s Disease	1.04	Low/Moderate Mortality
Diabetes	7.10	Low/Moderate Mortality
Cancer	0.55	Low/Moderate Mortality
Cholera	6.68	Low/Moderate Mortality

Multi-Category Recoding

A CASE statement with multiple WHEN ... THEN ... clauses.

Suppose we want to categorize diseases into Low, Moderate, and High mortality risk groups based on the mortality_pct variable.

-- Multi-category classification of mortality levels
SELECT 
    disease_name,
    mortality_pct,
    CASE 
        WHEN mortality_pct < 5 THEN 'Low'
        WHEN mortality_pct BETWEEN 5 AND 20 THEN 'Moderate'
        WHEN mortality_pct > 20 THEN 'High'
        ELSE 'Unknown'
    END AS mortality_category
FROM public.global_health_statistics
LIMIT 10;

Displaying records 1 - 10
disease_name	mortality_pct	mortality_category
Zika	3.08	Low
Cholera	2.69	Low
HIV/AIDS	0.22	Low
Asthma	8.27	Moderate
Alzheimer’s Disease	3.80	Low
COVID-19	0.89	Low
Tuberculosis	3.45	Low
HIV/AIDS	4.25	Low
Leprosy	4.55	Low
Diabetes	1.42	Low

SELECT
  country,
  disease_name,
  year_event,
  population_affected,
  CASE
    WHEN population_affected IS NULL               THEN 'unknown'
    WHEN population_affected <  1_000              THEN 'very low'
    WHEN population_affected < 10_000              THEN 'low'
    WHEN population_affected < 100_000             THEN 'moderate'
    WHEN population_affected < 1_000_000           THEN 'high'
    ELSE                                                'very high'
  END AS burden_bucket
FROM public.global_health_statistics
ORDER BY year_event DESC, population_affected DESC
LIMIT 10;

Displaying records 1 - 10
country	disease_name	year_event	population_affected	burden_bucket
Argentina	Measles	2024	999962	high
South Korea	Polio	2024	999921	high
Turkey	Influenza	2024	999910	high
India	Dengue	2024	999855	high
India	Polio	2024	999848	high
South Korea	Asthma	2024	999812	high
Russia	Polio	2024	999791	high
South Africa	Cancer	2024	999754	high
Italy	Diabetes	2024	999751	high
South Korea	Tuberculosis	2024	999711	high

Combining Multiple Variables

We can even combine conditions from more than one variable. Here, we create a category based on both disease type and age group.

-- Conditional grouping using disease and age group
SELECT 
    disease_name,
    age_group,
    CASE 
        WHEN disease_category = 'Infectious' AND age_group = '0-14' THEN 'Child Infectious'
        WHEN disease_category = 'Infectious' AND age_group = '65+' THEN 'Elderly Infectious'
        WHEN disease_category = 'Non-communicable' AND age_group = '15-64' THEN 'Adult NCD'
        ELSE 'Other'
    END AS combined_category
FROM public.global_health_statistics
LIMIT 10;

Displaying records 1 - 10
disease_name	age_group	combined_category
Zika	36-60	Other
Malaria	36-60	Other
Measles	0-18	Other
Leprosy	36-60	Other
Malaria	0-18	Other
Influenza	61+	Other
Diabetes	36-60	Other
Polio	0-18	Other
Zika	36-60	Other
Leprosy	61+	Other

Handling Missing Values (COALESCE, NULLIF)

COALESCE replaces missing values with a default. NULLIF prevents division by zero by returning NULL instead.

-- Replace null healthcare access values with 'Unknown'
SELECT
    country,
    disease_name,
    year_event,
    COALESCE(healthcare_access::text, 'Unknown') AS healthcare_access
FROM public.global_health_statistics
LIMIT 10;

Displaying records 1 - 10
country	disease_name	year_event	healthcare_access
Australia	Polio	2020	60.15
Italy	Zika	2004	86.37
Germany	Leprosy	2021	82.38
Nigeria	Dengue	2005	72.18
Germany	Influenza	2012	64.23
Argentina	Rabies	2014	68.28
UK	Zika	2007	70.82
South Korea	Alzheimer’s Disease	2003	78.57
Germany	Zika	2016	97.23
Mexico	Malaria	2020	84.61

-- Calculate incidence rate safely (avoid division by zero)
SELECT 
    country,
    disease_name,
    incidence / NULLIF(population_affected, 0) AS incidence_rate
FROM public.global_health_statistics
WHERE year_event = 2020
LIMIT 10;

Displaying records 1 - 10
country	disease_name	incidence_rate
Italy	Ebola	9.7e-06
Turkey	COVID-19	2.9e-06
Italy	Leprosy	1.3e-06
South Korea	Leprosy	4.6e-06
Argentina	Hepatitis	6.9e-06
Mexico	Measles	7.3e-06
Mexico	COVID-19	8.0e-06
France	Leprosy	3.9e-06
South Korea	Ebola	1.2e-06
Turkey	Measles	3.0e-07

4. Joins (INNER / LEFT / FULL)

To keep this self-contained, I’ll create two tables to demonstrate common join shapes:

country_region maps countries to regions.

disease_group maps diseases to broader categories

In a real project these would live in dim.* tables with clear stewardship.

CREATE TEMP TABLE country_region (
  country TEXT PRIMARY KEY,
  region  TEXT NOT NULL
);

INSERT INTO country_region(country, region) VALUES
  ('Australia','Oceania'),
  ('United States','North America'),
  ('Brazil','South America'),
  ('India','Asia'),
  ('Kenya','Africa');

CREATE TEMP TABLE disease_group (
  disease_name TEXT PRIMARY KEY,
  disease_group TEXT NOT NULL
);

INSERT INTO disease_group(disease_name, disease_group) VALUES
  ('Influenza A','Respiratory'),
  ('Influenza B','Respiratory'),
  ('COVID-19','Respiratory'),
  ('Dengue','Vector-borne'),
  ('Malaria','Vector-borne');

LEFT JOIN (keep all facts; enrich where possible)

Prefer this when facts must be retained even if the dimension is incomplete.

In the example below, LEFT JOIN ensures all health statistics are retained, even if metadata is missing.

SELECT
  g.country,
  COALESCE(cr.region, 'Unknown') AS region,
  g.disease_name,
  g.year_event,
  g.population_affected
FROM public.global_health_statistics AS g
LEFT JOIN country_region           AS cr
  ON cr.country = g.country
WHERE g.year_event BETWEEN 2018 AND 2022
ORDER BY g.year_event DESC, region, g.population_affected DESC
LIMIT 15;

Displaying records 1 - 10
country	region	disease_name	year_event	population_affected
India	Asia	Influenza	2022	998735
India	Asia	COVID-19	2022	998728
India	Asia	Cholera	2022	998331
India	Asia	Polio	2022	998188
India	Asia	Measles	2022	998057
India	Asia	Cancer	2022	996565
India	Asia	Hypertension	2022	994593
India	Asia	Diabetes	2022	994007
India	Asia	Alzheimer’s Disease	2022	993771
India	Asia	Zika	2022	993585

FULL JOIN: keep everything from both sides

Useful for reconciliation and data quality investigations.

SELECT
  COALESCE(g.country, cr.country) AS country_key,
  cr.region,
  g.disease_name,
  g.year_event,
  g.population_affected,
  CASE
    WHEN g.country IS NULL THEN 'missing_in_facts'
    WHEN cr.country IS NULL THEN 'missing_in_dim'
    ELSE 'matched'
  END AS join_status
FROM public.global_health_statistics AS g
FULL OUTER JOIN country_region      AS cr
  ON cr.country = g.country
ORDER BY join_status, country_key
LIMIT 10;

Displaying records 1 - 10
country_key	region	disease_name	year_event	population_affected	join_status
Australia	Oceania	Hypertension	2004	995949	matched
Australia	Oceania	Hepatitis	2009	996791	matched
Australia	Oceania	COVID-19	2000	995669	matched
Australia	Oceania	Measles	2021	977797	matched
Australia	Oceania	Asthma	2010	982446	matched
Australia	Oceania	Malaria	2012	993091	matched
Australia	Oceania	Cholera	2009	993428	matched
Australia	Oceania	Asthma	2015	966733	matched
Australia	Oceania	Asthma	2013	998228	matched
Australia	Oceania	Leprosy	2011	990548	matched

Joining multiple dimensions

Add both region and a disease_group, then compute region-level totals.

SELECT
  cr.region,
  dg.disease_group,
  g.year_event,
  SUM(g.population_affected) AS population_total
FROM public.global_health_statistics AS g
LEFT JOIN country_region             AS cr
  ON cr.country = g.country
LEFT JOIN disease_group              AS dg
  ON dg.disease_name = g.disease_name
WHERE g.year_event >= 2015
GROUP BY cr.region, dg.disease_group, g.year_event
ORDER BY cr.region NULLS LAST, dg.disease_group NULLS LAST, g.year_event;

Displaying records 1 - 10
region	disease_group	year_event	population_total
Asia	Respiratory	2015	996752
Asia	Respiratory	2016	988454
Asia	Respiratory	2017	999807
Asia	Respiratory	2018	997011
Asia	Respiratory	2019	991913
Asia	Respiratory	2020	997987
Asia	Respiratory	2021	998308
Asia	Respiratory	2022	998728
Asia	Respiratory	2023	980264
Asia	Respiratory	2024	994785

Anti-join pattern (find orphans)

Facts with unmapped countries (dimension coverage gap):

SELECT g.*
FROM public.global_health_statistics AS g
WHERE NOT EXISTS (
  SELECT 1
  FROM country_region AS cr
  WHERE cr.country = g.country
)
ORDER BY g.country, g.disease_name, g.year_event
LIMIT 10;

Displaying records 1 - 10
country	year_event	disease_name	disease_category	prevalence	incidence	mortality_pct	age_group	gender	population_affected	healthcare_access	dr_per_1000	hospital_beds_per_1000	treatment_type	average_treatment_cost_usd	availability_of_vaccines_treatment	recovery_rate_pct	dalys	improvement_in_5_years_pct	per_capita_income_usd	education_index	urbanization_rate_pct
Argentina	2000	Alzheimer’s Disease	Chronic	11.67	2.09	0.99	19-35	Male	999739	56.45	2.62	0.94	Vaccination	7431	Yes	93.08	2852	2.11	94376	0.45	89.55
Argentina	2001	Alzheimer’s Disease	Genetic	7.53	1.16	1.60	61+	Female	995329	55.56	2.92	9.01	Therapy	22389	No	89.26	2828	9.99	32326	0.71	51.91
Argentina	2002	Alzheimer’s Disease	Bacterial	19.84	2.36	7.98	36-60	Female	999312	80.47	1.66	7.84	Surgery	22607	Yes	52.38	4166	2.33	78110	0.82	71.93
Argentina	2003	Alzheimer’s Disease	Respiratory	2.71	1.16	6.98	61+	Female	993597	69.29	4.80	1.09	Therapy	22557	No	90.07	3888	8.11	92623	0.83	62.30
Argentina	2004	Alzheimer’s Disease	Viral	3.34	14.08	2.02	0-18	Male	991236	79.44	1.33	6.34	Surgery	18506	Yes	66.13	2646	5.26	70858	0.54	84.19
Argentina	2005	Alzheimer’s Disease	Genetic	7.95	2.05	6.92	61+	Male	994878	55.81	2.91	3.04	Therapy	8444	Yes	61.27	1902	2.97	70939	0.55	68.29
Argentina	2006	Alzheimer’s Disease	Cardiovascular	4.59	5.16	2.99	61+	Other	999475	63.33	4.97	3.40	Medication	14464	Yes	98.18	156	3.38	92196	0.55	33.07
Argentina	2007	Alzheimer’s Disease	Neurological	16.36	1.39	3.74	61+	Other	968726	90.92	1.87	2.74	Medication	10705	No	59.49	1924	4.10	40606	0.78	69.94
Argentina	2008	Alzheimer’s Disease	Neurological	9.55	4.24	5.27	0-18	Female	982297	68.95	2.43	1.69	Surgery	35053	Yes	82.13	3169	9.67	91423	0.71	78.84
Argentina	2009	Alzheimer’s Disease	Autoimmune	11.05	3.43	7.97	36-60	Other	984953	57.13	4.31	5.23	Medication	28014	Yes	92.87	759	2.42	95653	0.75	51.54

Anti-join via LEFT JOIN … IS NULL (equivalent)


SELECT g.*
FROM public.global_health_statistics AS g
LEFT JOIN country_region AS cr
  ON cr.country = g.country
WHERE cr.country IS NULL
ORDER BY g.country, g.disease_name, g.year_event
LIMIT 20;

Displaying records 1 - 10
country	year_event	disease_name	disease_category	prevalence	incidence	mortality_pct	age_group	gender	population_affected	healthcare_access	dr_per_1000	hospital_beds_per_1000	treatment_type	average_treatment_cost_usd	availability_of_vaccines_treatment	recovery_rate_pct	dalys	improvement_in_5_years_pct	per_capita_income_usd	education_index	urbanization_rate_pct
Argentina	2000	Alzheimer’s Disease	Chronic	11.67	2.09	0.99	19-35	Male	999739	56.45	2.62	0.94	Vaccination	7431	Yes	93.08	2852	2.11	94376	0.45	89.55
Argentina	2001	Alzheimer’s Disease	Genetic	7.53	1.16	1.60	61+	Female	995329	55.56	2.92	9.01	Therapy	22389	No	89.26	2828	9.99	32326	0.71	51.91
Argentina	2002	Alzheimer’s Disease	Bacterial	19.84	2.36	7.98	36-60	Female	999312	80.47	1.66	7.84	Surgery	22607	Yes	52.38	4166	2.33	78110	0.82	71.93
Argentina	2003	Alzheimer’s Disease	Respiratory	2.71	1.16	6.98	61+	Female	993597	69.29	4.80	1.09	Therapy	22557	No	90.07	3888	8.11	92623	0.83	62.30
Argentina	2004	Alzheimer’s Disease	Viral	3.34	14.08	2.02	0-18	Male	991236	79.44	1.33	6.34	Surgery	18506	Yes	66.13	2646	5.26	70858	0.54	84.19
Argentina	2005	Alzheimer’s Disease	Genetic	7.95	2.05	6.92	61+	Male	994878	55.81	2.91	3.04	Therapy	8444	Yes	61.27	1902	2.97	70939	0.55	68.29
Argentina	2006	Alzheimer’s Disease	Cardiovascular	4.59	5.16	2.99	61+	Other	999475	63.33	4.97	3.40	Medication	14464	Yes	98.18	156	3.38	92196	0.55	33.07
Argentina	2007	Alzheimer’s Disease	Neurological	16.36	1.39	3.74	61+	Other	968726	90.92	1.87	2.74	Medication	10705	No	59.49	1924	4.10	40606	0.78	69.94
Argentina	2008	Alzheimer’s Disease	Neurological	9.55	4.24	5.27	0-18	Female	982297	68.95	2.43	1.69	Surgery	35053	Yes	82.13	3169	9.67	91423	0.71	78.84
Argentina	2009	Alzheimer’s Disease	Autoimmune	11.05	3.43	7.97	36-60	Other	984953	57.13	4.31	5.23	Medication	28014	Yes	92.87	759	2.42	95653	0.75	51.54

Semi-Join Pattern (Keep Only Facts That Have a Match)

Goal: return fact rows that have at least one matching dimension row, without returning dimension columns or duplicating facts.
Canonical SQL: WHERE EXISTS (...) (preferred) or IN (subquery).
Unlike inner joins, a semi-join returns each qualifying fact at most once.

Semi-join via EXISTS ✅ (recommended)

SELECT g.country, g.disease_name, g.year_event, g.population_affected
FROM public.global_health_statistics AS g
WHERE EXISTS (
  SELECT 1
  FROM country_region AS cr
  WHERE cr.country = g.country
)
ORDER BY g.country, g.disease_name, g.year_event
LIMIT 10;

Displaying records 1 - 10
country	disease_name	year_event	population_affected
Australia	Alzheimer’s Disease	2000	999987
Australia	Alzheimer’s Disease	2001	999380
Australia	Alzheimer’s Disease	2002	989422
Australia	Alzheimer’s Disease	2003	995752
Australia	Alzheimer’s Disease	2004	993018
Australia	Alzheimer’s Disease	2005	994235
Australia	Alzheimer’s Disease	2006	999998
Australia	Alzheimer’s Disease	2007	985789
Australia	Alzheimer’s Disease	2008	994427
Australia	Alzheimer’s Disease	2009	981466

3. Appending Data (Stacking Rows)

Appending (row-binding) is done using UNION or UNION ALL.

UNION ALL keeps duplicates.

UNION removes duplicates.

-- Append two subsets of the same table (stack rows)
SELECT country, year_event, disease_name, population_affected
FROM public.global_health_statistics
WHERE year_event <= 2015

UNION ALL

SELECT country, year_event, disease_name, population_affected
FROM public.global_health_statistics
WHERE year_event > 2015
ORDER BY year_event
LIMIT 20;

Displaying records 1 - 10
country	year_event	disease_name	population_affected
India	2000	Hypertension	989212
Australia	2000	Rabies	999093
France	2000	COVID-19	996188
Canada	2000	HIV/AIDS	997838
Italy	2000	Tuberculosis	982734
Russia	2000	Ebola	994461
Brazil	2000	Ebola	998677
Russia	2000	Cancer	994540
Australia	2000	HIV/AIDS	998771
Saudi Arabia	2000	Dengue	981698

Part V — Advanced Wrangling

Beyond basic SELECT and JOIN operations, SQL offers powerful tools for advanced wrangling:

Common Table Expressions (CTEs) for readability and modular pipelines
Window functions for ranking, partitioning, and moving calculations
Deduplication using ROW_NUMBER()
Upserts for managing incremental updates

These techniques are essential for real-world analytics pipelines.

1. Common Table Expressions (CTEs)

A CTE lets us build queries in readable, stepwise blocks. It’s the SQL equivalent of creating intermediate objects in R with the pipe %>%.

In the example below, the first block (mortality_groups) is like an intermediate dataset. The second query reuses it for further filtering.

-- CTE to calculate mortality categories, then filter
WITH mortality_groups AS (
  SELECT 
      country,
      disease_name,
      year_event,
      mortality_pct,
      CASE 
        WHEN mortality_pct > 9 THEN 'High Mortality'
        ELSE 'Low/Moderate Mortality'
      END AS mortality_group
  FROM public.global_health_statistics
)
SELECT *
FROM mortality_groups
WHERE mortality_group = 'High Mortality'
LIMIT 10;

Displaying records 1 - 10
country	disease_name	year_event	mortality_pct	mortality_group
Nigeria	Malaria	2002	9.97	High Mortality
South Korea	Hepatitis	2024	9.48	High Mortality
Argentina	HIV/AIDS	2007	9.65	High Mortality
Brazil	Polio	2005	9.60	High Mortality
Australia	Malaria	2012	9.50	High Mortality
India	Influenza	2002	9.22	High Mortality
Canada	Polio	2012	9.91	High Mortality
Italy	Malaria	2020	9.24	High Mortality
Russia	Malaria	2008	9.72	High Mortality
Nigeria	Asthma	2013	9.15	High Mortality

2. Window Functions

Window functions allow calculations across partitions of data without collapsing rows. They are powerful for ranking, moving averages, and comparisons.

Ranking within Groups

-- Rank diseases by affected population per country in 2020
SELECT 
    country,
    disease_name,
    population_affected,
    RANK() OVER (PARTITION BY country ORDER BY population_affected DESC) AS rank_within_country
FROM public.global_health_statistics
WHERE year_event = 2020
ORDER BY country, rank_within_country
LIMIT 10;

Displaying records 1 - 10
country	disease_name	population_affected	rank_within_country
Argentina	Cancer	999522	1
Argentina	Dengue	997702	2
Argentina	COVID-19	997478	3
Argentina	Hypertension	997396	4
Argentina	Alzheimer’s Disease	995954	5
Argentina	Tuberculosis	995534	6
Argentina	Ebola	995350	7
Argentina	HIV/AIDS	992933	8
Argentina	Diabetes	992740	9
Argentina	Zika	992414	10

Calculating a rolling 3-year average for selected diseases.

-- 3-year moving average of incidence per disease
SELECT 
    disease_name,
    year_event,
    incidence,
    ROUND(
      (AVG(incidence) OVER (
          PARTITION BY disease_name 
          ORDER BY year_event 
          ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
      ))::numeric, 2
    ) AS incidence_moving_avg
FROM public.global_health_statistics
WHERE disease_name IN ('Malaria','Tuberculosis')
ORDER BY disease_name, year_event
LIMIT 15;

Displaying records 1 - 10
disease_name	year_event	incidence	incidence_moving_avg
Malaria	2000	14.50	14.50
Malaria	2000	12.62	13.56
Malaria	2000	8.22	11.78
Malaria	2000	5.83	8.89
Malaria	2000	3.87	5.97
Malaria	2000	12.26	7.32
Malaria	2000	6.30	7.48
Malaria	2000	4.53	7.70
Malaria	2000	4.33	5.05
Malaria	2000	1.99	3.62

3. Deduplication with ROW_NUMBER()

Sometimes datasets contain duplicate records. We can use ROW_NUMBER() to keep the “first” occurrence and drop the rest.

Identify duplicates across all columns

The example below will show you which rows are duplicated, and how many times.

For example, if the same country + disease + year + population row appears 3 times, you’ll see duplicate_count = 3.

SELECT 
    country,
    disease_name,
    year_event,
    population_affected,
    COUNT(*) AS duplicate_count
FROM public.global_health_statistics
GROUP BY 
    country,
    disease_name,
    year_event,
    population_affected
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC
LIMIT 10;

0 records
country	disease_name	year_event	population_affected	duplicate_count

If you want to remove duplicates, keeping only one record per case

-- This removes duplicates across all columns.
SELECT DISTINCT *
FROM public.global_health_statistics
LIMIT 15;

Displaying records 1 - 10
country	year_event	disease_name	disease_category	prevalence	incidence	mortality_pct	age_group	gender	population_affected	healthcare_access	dr_per_1000	hospital_beds_per_1000	treatment_type	average_treatment_cost_usd	availability_of_vaccines_treatment	recovery_rate_pct	dalys	improvement_in_5_years_pct	per_capita_income_usd	education_index	urbanization_rate_pct
Argentina	2000	Alzheimer’s Disease	Chronic	11.67	2.09	0.99	19-35	Male	999739	56.45	2.62	0.94	Vaccination	7431	Yes	93.08	2852	2.11	94376	0.45	89.55
Argentina	2000	Asthma	Parasitic	6.03	11.82	4.71	36-60	Male	999306	65.53	3.85	2.67	Vaccination	34713	Yes	87.90	2810	0.42	83780	0.76	22.01
Argentina	2000	Cancer	Bacterial	19.72	7.65	4.07	61+	Other	980822	57.52	3.96	5.92	Medication	36545	Yes	56.26	726	1.43	86501	0.61	70.75
Argentina	2000	Cholera	Infectious	19.31	13.36	6.39	36-60	Other	991759	64.20	0.87	7.23	Therapy	5415	No	51.64	4251	7.54	95337	0.81	83.87
Argentina	2000	COVID-19	Genetic	5.87	4.43	3.91	36-60	Female	996726	74.90	4.33	0.79	Surgery	48595	No	84.81	494	0.13	64551	0.41	46.80
Argentina	2000	Dengue	Chronic	6.42	13.52	4.60	61+	Female	997142	69.02	1.13	4.69	Therapy	15127	Yes	53.30	4906	5.08	98961	0.51	40.18
Argentina	2000	Diabetes	Viral	9.74	4.52	1.08	0-18	Other	996077	60.79	3.10	2.41	Medication	43795	No	56.47	4572	4.81	75035	0.80	68.70
Argentina	2000	Ebola	Neurological	11.57	2.21	7.15	61+	Female	996462	68.50	1.44	9.78	Medication	33639	No	98.32	4943	8.77	21021	0.63	74.10
Argentina	2000	Hepatitis	Viral	6.85	6.60	4.23	61+	Other	975406	83.04	3.48	3.67	Vaccination	17061	No	73.33	1911	3.90	37765	0.71	48.95
Argentina	2000	HIV/AIDS	Autoimmune	7.45	4.96	0.69	36-60	Female	996694	76.82	3.45	3.63	Medication	34932	Yes	78.58	3455	6.66	40875	0.77	89.73

If you only want to de-duplicate on some columns (e.g., country, disease_name, year_event), then use DISTINCT ON (Postgres-only):

SELECT DISTINCT ON (country, disease_name, year_event) *
FROM public.global_health_statistics
ORDER BY country, disease_name, year_event, population_affected DESC
LIMIT 15;

Displaying records 1 - 10
country	year_event	disease_name	disease_category	prevalence	incidence	mortality_pct	age_group	gender	population_affected	healthcare_access	dr_per_1000	hospital_beds_per_1000	treatment_type	average_treatment_cost_usd	availability_of_vaccines_treatment	recovery_rate_pct	dalys	improvement_in_5_years_pct	per_capita_income_usd	education_index	urbanization_rate_pct
Argentina	2000	Alzheimer’s Disease	Chronic	11.67	2.09	0.99	19-35	Male	999739	56.45	2.62	0.94	Vaccination	7431	Yes	93.08	2852	2.11	94376	0.45	89.55
Argentina	2001	Alzheimer’s Disease	Genetic	7.53	1.16	1.60	61+	Female	995329	55.56	2.92	9.01	Therapy	22389	No	89.26	2828	9.99	32326	0.71	51.91
Argentina	2002	Alzheimer’s Disease	Bacterial	19.84	2.36	7.98	36-60	Female	999312	80.47	1.66	7.84	Surgery	22607	Yes	52.38	4166	2.33	78110	0.82	71.93
Argentina	2003	Alzheimer’s Disease	Respiratory	2.71	1.16	6.98	61+	Female	993597	69.29	4.80	1.09	Therapy	22557	No	90.07	3888	8.11	92623	0.83	62.30
Argentina	2004	Alzheimer’s Disease	Viral	3.34	14.08	2.02	0-18	Male	991236	79.44	1.33	6.34	Surgery	18506	Yes	66.13	2646	5.26	70858	0.54	84.19
Argentina	2005	Alzheimer’s Disease	Genetic	7.95	2.05	6.92	61+	Male	994878	55.81	2.91	3.04	Therapy	8444	Yes	61.27	1902	2.97	70939	0.55	68.29
Argentina	2006	Alzheimer’s Disease	Cardiovascular	4.59	5.16	2.99	61+	Other	999475	63.33	4.97	3.40	Medication	14464	Yes	98.18	156	3.38	92196	0.55	33.07
Argentina	2007	Alzheimer’s Disease	Neurological	16.36	1.39	3.74	61+	Other	968726	90.92	1.87	2.74	Medication	10705	No	59.49	1924	4.10	40606	0.78	69.94
Argentina	2008	Alzheimer’s Disease	Neurological	9.55	4.24	5.27	0-18	Female	982297	68.95	2.43	1.69	Surgery	35053	Yes	82.13	3169	9.67	91423	0.71	78.84
Argentina	2009	Alzheimer’s Disease	Autoimmune	11.05	3.43	7.97	36-60	Other	984953	57.13	4.31	5.23	Medication	28014	Yes	92.87	759	2.42	95653	0.75	51.54

Show the actual duplicate rows (not just counts) This way you actually get the full duplicate rows back, along with a duplicate_count column.

SELECT *
FROM (
    SELECT *,
           COUNT(*) OVER (
               PARTITION BY country, disease_name, year_event, population_affected
           ) AS duplicate_count
    FROM public.global_health_statistics
) t
WHERE duplicate_count > 1
ORDER BY country, disease_name, year_event
LIMIT 10;

0 records
country	year_event	disease_name	disease_category	prevalence	incidence	mortality_pct	age_group	gender	population_affected	healthcare_access	dr_per_1000	hospital_beds_per_1000	treatment_type	average_treatment_cost_usd	availability_of_vaccines_treatment	recovery_rate_pct	dalys	improvement_in_5_years_pct	per_capita_income_usd	education_index	urbanization_rate_pct	duplicate_count

-- Keep only the latest record per country/disease
WITH ranked AS (
  SELECT 
      country,
      disease_name,
      year_event,
      population_affected,
      ROW_NUMBER() OVER (
          PARTITION BY country, disease_name 
          ORDER BY year_event DESC
      ) AS rn
  FROM public.global_health_statistics
)
SELECT *
FROM ranked
WHERE rn = 1
ORDER BY country, disease_name
LIMIT 20;

Displaying records 1 - 10
country	disease_name	year_event	population_affected	rn
Argentina	Alzheimer’s Disease	2024	994386	1
Argentina	Asthma	2024	997195	1
Argentina	Cancer	2024	993535	1
Argentina	Cholera	2024	999242	1
Argentina	COVID-19	2024	988771	1
Argentina	Dengue	2024	988472	1
Argentina	Diabetes	2024	987174	1
Argentina	Ebola	2024	995657	1
Argentina	Hepatitis	2024	990629	1
Argentina	HIV/AIDS	2024	989442	1

4. Upserts Overview

In real-world pipelines, we often receive incremental updates (new years, new countries, or corrected values). An upsert (update + insert) ensures the table stays in sync:

If a record exists → update it

If not → insert it

The ON CONFLICT clause ensures data is not duplicated. Instead, existing rows are updated.

Tip: We need to have a unique identifier for each record.

Keep only the latest record per duplicate

DELETE FROM public.global_health_statistics a
USING (
    SELECT ctid, ROW_NUMBER() OVER (
               PARTITION BY country, disease_name, year_event
               ORDER BY population_affected DESC
           ) AS rn
    FROM public.global_health_statistics
) b
WHERE a.ctid = b.ctid AND b.rn > 1;

-- Creating the unique identifier
ALTER TABLE public.global_health_statistics
DROP CONSTRAINT IF EXISTS ghs_unique;

ALTER TABLE public.global_health_statistics
ADD CONSTRAINT ghs_unique UNIQUE (country, year_event, disease_name);

-- Insert or do nothing if the combination already exists
INSERT INTO public.global_health_statistics (
    country, disease_name, year_event
)
VALUES (
    'Australia', 'Influenza', 2022
)
ON CONFLICT (country, disease_name, year_event)
DO NOTHING;

Part VI — Data Quality, Types & Dates

🧭 Overview

High-quality analysis starts with high-quality data.
This section focuses on assessing and improving data integrity, handling missing values, managing data types, and performing date/time transformations.

In practice, data rarely arrives in perfect shape — missing records, inconsistent data types, and messy strings are common.
Being able to diagnose and correct these issues directly in SQL demonstrates the ability to build analysis-ready datasets that decision-makers can trust.

Missing data occurs for many reasons: incomplete records, data-entry errors, or missing identifiers.
Understanding, profiling, and deciding how to handle these missing values are essential steps before any statistical or predictive modeling.

The workflow includes: - Profiling missing data
- Reviewing incomplete records
- Applying imputation (filling) strategies
- Forward and backward filling within patients
- Mean-based imputations

All SQL examples below are written for PostgreSQL but can be adapted easily for SQL Server or SQLite.

1️⃣ Importing and Preparing the Data

To begin, the dataset is imported into PostgreSQL as public.visit_hospital.
This process ensures that the structure matches the source CSV file (C:\temp\visit_hospital.csv).

-- Create table for hospital visit data
DROP TABLE IF EXISTS public.visit_hospital;

CREATE TABLE public.visit_hospital (
  -- Identifiers
  id TEXT, id_text TEXT,
  -- Demographics
  birthdate DATE, deathdate DATE, ssn TEXT, drivers TEXT, passport TEXT,
  prefix TEXT, first TEXT, middle TEXT, last TEXT, suffix TEXT,
  maiden TEXT, marital TEXT, race TEXT, ethnicity TEXT, gender TEXT,  
  -- Location
  birthplace TEXT, address TEXT, city TEXT, state TEXT, county TEXT,
  fips INT, zip INT, lat NUMERIC, lon NUMERIC,  
  -- Healthcare / Socioeconomic
  healthcare_expenses NUMERIC, healthcare_coverage NUMERIC, income NUMERIC,  
  -- Visit metadata
  visit INT
);

-- Stage the raw text date
ALTER TABLE public.visit_hospital ADD COLUMN birthdate_raw TEXT;

-- Clean it into proper DATE
UPDATE public.visit_hospital
SET birthdate = TO_DATE(birthdate_raw, 'DD/MM/YYYY')
WHERE birthdate_raw ~ '^\d{2}/\d{2}/\d{4}$';

-- After loading into a staging text column:
-- Stage the raw text date
ALTER TABLE public.visit_hospital ADD COLUMN IF NOT EXISTS birthdate_raw TEXT;

-- COPY with:
--  - NULL 'NA'      -> turn "NA" into NULL
--  - FORCE_NULL (...) -> also treat empty strings as NULL for listed columns
COPY public.visit_hospital (
  id, birthdate_raw, deathdate, ssn, drivers, passport,
  prefix, first, middle, last, suffix, maiden,
  marital, race, ethnicity, gender,
  birthplace, address, city, state, county,
  fips, zip, lat, lon,
  healthcare_expenses, healthcare_coverage, income,
  visit, id_text
)
FROM 'C:/temp/visit_hospital.csv'
WITH (
  FORMAT CSV,
  HEADER TRUE,
  NULL 'NA',
  FORCE_NULL (birthdate_raw, deathdate, fips, zip, lat, lon,
              healthcare_expenses, healthcare_coverage, income),
  DELIMITER ',',
  QUOTE '"',
  ESCAPE '"',
  ENCODING 'UTF8');

UPDATE public.visit_hospital
SET birthdate = TO_DATE(birthdate_raw, 'DD/MM/YYYY')
WHERE birthdate_raw ~ '^\d{2}/\d{2}/\d{4}$';

2️⃣ Profiling Missing Data

Understanding which columns have missing values — and how many — is the foundation of any data-quality review.

Before fixing missing data, we must understand how much is missing and where. This first step mimics colSums(is.na()) in R — it counts missing values for each important column.

-- Summarize missing values per selected columns
SELECT
  'birthdate' AS column_name, COUNT(*) FILTER (WHERE birthdate IS NULL) AS missing_count FROM public.visit_hospital
UNION ALL SELECT 'deathdate', COUNT(*) FILTER (WHERE deathdate IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'race', COUNT(*) FILTER (WHERE race IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'income', COUNT(*) FILTER (WHERE income IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'passport', COUNT(*) FILTER (WHERE passport IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'fips', COUNT(*) FILTER (WHERE fips IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'zip', COUNT(*) FILTER (WHERE zip IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'healthcare_coverage', COUNT(*) FILTER (WHERE healthcare_coverage IS NULL) FROM public.visit_hospital;

8 records
column_name	missing_count
birthdate	16
deathdate	40
race	20
income	20
passport	12
fips	12
zip	12
healthcare_coverage	4

Context: Each row in this summary represents a variable and the number of NULL (missing) entries. This gives you an at-a-glance understanding of where data quality issues exist, and whether they are systematic (entire columns) or random (sporadic records).

To get the total number of missing cells across the dataset:

-- Calculate total number of missing cells across selected columns
SELECT
  SUM(
    (birthdate IS NULL)::INT +
    (deathdate IS NULL)::INT +
    (race IS NULL)::INT +
    (income IS NULL)::INT +
    (passport IS NULL)::INT +
    (fips IS NULL)::INT +
    (zip IS NULL)::INT +
    (healthcare_coverage IS NULL)::INT
  ) AS total_missing_cells
FROM public.visit_hospital;

1 records
total_missing_cells
136

This produces one number — the total missing count — useful for data-quality reporting or automated alerts in data pipelines.

3️⃣ Inspecting Rows with Missing Data

After identifying which columns contain missing values, it’s helpful to inspect a few affected records.

-- Display sample of records that contain missing values
SELECT id_text, visit, race, income, passport
FROM public.visit_hospital
WHERE race IS NULL OR income IS NULL OR passport IS NULL
ORDER BY id_text, visit
LIMIT 12;

Displaying records 1 - 10
id_text	visit	race	income	passport
Anit\|Sánc\|1960-05-07	2	NA	NA	X9157439X
Anit\|Sánc\|1960-05-07	3	white	NA	X9157439X
Anit\|Sánc\|1960-05-07	4	NA	60016	X9157439X
Aure\|Weis\|2003-11-12	2	NA	NA	X17586296X
Aure\|Weis\|2003-11-12	3	white	NA	X17586296X
Aure\|Weis\|2003-11-12	4	NA	7752	X17586296X
Cars\|Litt\|2012-09-02	1	white	71169	NA
Cars\|Litt\|2012-09-02	2	NA	NA	NA
Cars\|Litt\|2012-09-02	3	NA	70169	NA
Cars\|Litt\|2012-09-02	4	white	NA	NA

This query mimics filter(!complete.cases(.)) in R. It reveals the context of missing values — are they clustered by person, visit, or variable? This step helps determine the appropriate imputation method (e.g., individual-level fill vs. global averages).

4️⃣ Handling Missing Values (Imputation)

Imputation means replacing missing data with estimated or default values. This section demonstrates several techniques — from simple replacements to forward/backward fills.

4.1 Simple categorical imputation: Replacement with COALESCE()

-- Replace missing passport values with 'UNKNOWN'
SELECT
  id_text,
  visit,
  passport,
  COALESCE(passport, 'UNKNOWN') AS passport_imputed
FROM public.visit_hospital
WHERE passport IS NULL
LIMIT 12;

Displaying records 1 - 10
id_text	visit	passport	passport_imputed
Cars\|Litt\|2012-09-02	1	NA	UNKNOWN
Doll\|Fran\|2008-10-04	1	NA	UNKNOWN
Cars\|Litt\|2012-09-02	2	NA	UNKNOWN
Doll\|Fran\|2008-10-04	2	NA	UNKNOWN
Cars\|Litt\|2012-09-02	3	NA	UNKNOWN
Doll\|Fran\|2008-10-04	3	NA	UNKNOWN
Cars\|Litt\|2012-09-02	4	NA	UNKNOWN
Doll\|Fran\|2008-10-04	4	NA	UNKNOWN
Jimm\|Doyl\|2005-04-25	1	NA	UNKNOWN
Jimm\|Doyl\|2005-04-25	2	NA	UNKNOWN

COALESCE() replaces NULL with a specified value. For example, unknown passport numbers might represent unregistered patients — instead of leaving them blank, we explicitly mark them as “UNKNOWN”. This improves downstream clarity and avoids confusion between “missing” and “not applicable.”

4.2 Numeric imputations

Global mean: stable when per-ID history is sparse. Participant mean: more tailored when individuals have multiple visits.

4.2.1 Global Mean Imputation for Numeric Data

-- Compute overall average income (ignoring missing values)
SELECT AVG(income) AS global_mean_income
FROM public.visit_hospital
WHERE income IS NOT NULL;

1 records
global_mean_income
96123.1

-- Replace missing income values with global mean
SELECT
  id_text,
  visit,
  income,
  COALESCE(income, (SELECT AVG(income) FROM public.visit_hospital WHERE income IS NOT NULL)) AS income_imputed_global
FROM public.visit_hospital
ORDER BY id_text, visit
LIMIT 12;

Displaying records 1 - 10
id_text	visit	income	income_imputed_global
Anit\|Sánc\|1960-05-07	1	61016	61016.0
Anit\|Sánc\|1960-05-07	2	NA	96123.1
Anit\|Sánc\|1960-05-07	3	NA	96123.1
Anit\|Sánc\|1960-05-07	4	60016	60016.0
Aure\|Weis\|2003-11-12	1	8752	8752.0
Aure\|Weis\|2003-11-12	2	NA	96123.1
Aure\|Weis\|2003-11-12	3	NA	96123.1
Aure\|Weis\|2003-11-12	4	7752	7752.0
Cars\|Litt\|2012-09-02	1	71169	71169.0
Cars\|Litt\|2012-09-02	2	NA	96123.1

This replaces missing income values with the overall mean. While not ideal for predictive models (since it reduces variance), it provides a consistent baseline for descriptive summaries and visualization readiness.

4.2.2 Group-Level (Participant) Mean Imputation A more refined approach is to use each patient’s own average income (if available). This reduces bias when patients vary substantially in income levels.

To better respect within-person variation, we calculate mean income for each participant (id_text) and use that as a fill value.

-- Compute mean income per patient
WITH participant_mean AS (
  SELECT id_text, AVG(income) AS mean_income
  FROM public.visit_hospital
  WHERE income IS NOT NULL
  GROUP BY id_text
)
SELECT
  v.id_text,
  v.visit,
  v.income,
  COALESCE(v.income, p.mean_income) AS income_imputed_participant
FROM public.visit_hospital v
LEFT JOIN participant_mean p
  ON v.id_text = p.id_text
ORDER BY v.id_text, v.visit
LIMIT 12;

Displaying records 1 - 10
id_text	visit	income	income_imputed_participant
Anit\|Sánc\|1960-05-07	1	61016	61016
Anit\|Sánc\|1960-05-07	2	NA	60516
Anit\|Sánc\|1960-05-07	3	NA	60516
Anit\|Sánc\|1960-05-07	4	60016	60016
Aure\|Weis\|2003-11-12	1	8752	8752
Aure\|Weis\|2003-11-12	2	NA	8252
Aure\|Weis\|2003-11-12	3	NA	8252
Aure\|Weis\|2003-11-12	4	7752	7752
Cars\|Litt\|2012-09-02	1	71169	71169
Cars\|Litt\|2012-09-02	2	NA	70669

Context: This technique aligns with R’s group_by(id_text) %>% summarise(mean(income, na.rm = TRUE)). It produces personalized imputations and avoids using global means that may misrepresent individual economic patterns.

4.3 Forward and Backward Fill (Sequential Imputation)

When data is longitudinal (multiple visits per patient), we may want to carry values forward or backward. SQL does not have a native fill() like R’s tidyr, but sorting prepares the data for analytic solutions.

Forward and backward fills propagate the most recent known value across sequential observations — useful when attributes like race or gender should remain constant over multiple visits.

Forward filling uses the last known race value, while backward filling retrieves the next known one. These techniques are valuable when longitudinal patient data should remain internally consistent.

Why both? In visit data, the first or last known race may be the only non-missing value for a participant. Having forward and backward fills lets me choose the nearest known value relative to each visit.

Forward Fill (by id_text, ordered by visit)

-- Forward fill race within each id_text
WITH base AS (
  SELECT
    id_text,
    visit,
    race,
    -- running group: increases when we hit a non-NULL race
    SUM(CASE WHEN race IS NOT NULL THEN 1 ELSE 0 END)
      OVER (PARTITION BY id_text ORDER BY visit
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS grp
  FROM public.visit_hospital
)
SELECT
  id_text,
  visit,
  race,
  -- carry last known non-NULL forward; stays NULL until first non-NULL appears
  MAX(race) OVER (PARTITION BY id_text, grp) AS race_forward
FROM base
ORDER BY id_text, visit
LIMIT 12;

Displaying records 1 - 10
id_text	visit	race	race_forward
Anit\|Sánc\|1960-05-07	1	white	white
Anit\|Sánc\|1960-05-07	2	NA	white
Anit\|Sánc\|1960-05-07	3	white	white
Anit\|Sánc\|1960-05-07	4	NA	white
Aure\|Weis\|2003-11-12	1	white	white
Aure\|Weis\|2003-11-12	2	NA	white
Aure\|Weis\|2003-11-12	3	white	white
Aure\|Weis\|2003-11-12	4	NA	white
Cars\|Litt\|2012-09-02	1	white	white
Cars\|Litt\|2012-09-02	2	NA	white

Backward Fill

-- Backward fill missing race values by patient

WITH base AS (
  SELECT
    id_text,
    visit,
    race,
    -- running group from the right: increases when we hit a non-NULL race scanning backwards
    SUM(CASE WHEN race IS NOT NULL THEN 1 ELSE 0 END)
      OVER (PARTITION BY id_text ORDER BY visit DESC
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS grp_rev
  FROM public.visit_hospital
)
SELECT
  id_text,
  visit,
  race,
  MAX(race) OVER (PARTITION BY id_text, grp_rev) AS race_backward
FROM base
ORDER BY id_text, visit
LIMIT 12;

Displaying records 1 - 10
id_text	visit	race	race_backward
Anit\|Sánc\|1960-05-07	1	white	white
Anit\|Sánc\|1960-05-07	2	NA	white
Anit\|Sánc\|1960-05-07	3	white	white
Anit\|Sánc\|1960-05-07	4	NA	NA
Aure\|Weis\|2003-11-12	1	white	white
Aure\|Weis\|2003-11-12	2	NA	white
Aure\|Weis\|2003-11-12	3	white	white
Aure\|Weis\|2003-11-12	4	NA	NA
Cars\|Litt\|2012-09-02	1	white	white
Cars\|Litt\|2012-09-02	2	NA	white

4.4 Combined imputation review (side-by-side) The example below shows a practical “imputation report”: original vs. forward/backward race and both income fills, restricted to rows that actually had gaps. This gives reviewers a one-glance sanity check.


WITH base AS (
  SELECT
    id_text, visit, race, income
  FROM public.visit_hospital
),
-- forward-fill race within id_text
ff_prep AS (
  SELECT *,
         SUM((race IS NOT NULL)::int)
           OVER (PARTITION BY id_text ORDER BY visit) AS grp
  FROM base
),
fwd AS (
  SELECT id_text, visit,
         MAX(race) OVER (PARTITION BY id_text, grp) AS race_forward
  FROM ff_prep
),
-- backward-fill race within id_text (for leading NULLs)
bf_prep AS (
  SELECT *,
         SUM((race IS NOT NULL)::int)
           OVER (PARTITION BY id_text ORDER BY visit DESC) AS grp_rev
  FROM base
),
bwd AS (
  SELECT id_text, visit,
         MAX(race) OVER (PARTITION BY id_text, grp_rev) AS race_backward
  FROM bf_prep
),
-- means for income imputation
gmean AS (
  SELECT AVG(income) AS global_mean_income
  FROM public.visit_hospital
  WHERE income IS NOT NULL
),
pmean AS (
  SELECT id_text, AVG(income) AS mean_income_id
  FROM public.visit_hospital
  WHERE income IS NOT NULL
  GROUP BY id_text
)
SELECT
  v.id_text, v.visit,
  v.race,
  f.race_forward,
  b.race_backward,
  v.income,
  COALESCE(v.income, g.global_mean_income) AS income_global_mean,
  COALESCE(v.income, p.mean_income_id)     AS income_participant_mean
FROM public.visit_hospital v
LEFT JOIN fwd  f USING (id_text, visit)
LEFT JOIN bwd  b USING (id_text, visit)
CROSS JOIN gmean g
LEFT JOIN pmean p USING (id_text)
WHERE v.race IS NULL OR v.income IS NULL
ORDER BY v.id_text, v.visit NULLS FIRST
LIMIT 15;

Displaying records 1 - 10
id_text	visit	race	race_forward	race_backward	income	income_global_mean	income_participant_mean
Anit\|Sánc\|1960-05-07	2	NA	white	white	NA	96123.1	60516
Anit\|Sánc\|1960-05-07	3	white	white	white	NA	96123.1	60516
Anit\|Sánc\|1960-05-07	4	NA	white	NA	60016	60016.0	60016
Aure\|Weis\|2003-11-12	2	NA	white	white	NA	96123.1	8252
Aure\|Weis\|2003-11-12	3	white	white	white	NA	96123.1	8252
Aure\|Weis\|2003-11-12	4	NA	white	NA	7752	7752.0	7752
Cars\|Litt\|2012-09-02	2	NA	white	white	NA	96123.1	70669
Cars\|Litt\|2012-09-02	3	NA	white	white	70169	70169.0	70169
Cars\|Litt\|2012-09-02	4	white	white	white	NA	96123.1	70669
Doll\|Fran\|2008-10-04	2	NA	white	white	NA	96123.1	187523

🧮 Summary Table of Missing Data Techniques

Task / Objective	SQL Feature or Function	Purpose / What It Does	When to Use It
Profile Missing Data	`COUNT(*) FILTER (WHERE col IS NULL)`	Counts how many missing (NULL) entries exist in each column	Always the first step in a data-quality audit
Check Total Missing Cells	Logical addition `(col IS NULL)::INT` summed across columns	Produces total missing count across all variables	Useful for overall data completeness score
Inspect Incomplete Records	`WHERE col IS NULL OR ...`	Shows which specific rows have missing values	Use to detect patterns in missingness (e.g., per patient or visit)
Replace Missing Text Values	`COALESCE(column, 'UNKNOWN')`	Fills missing categorical data with an explicit label	Good for identifiers or attributes like `passport`, `gender`, or `ethnicity`
Replace Missing Numeric Values (Global Mean)	`COALESCE(income, (SELECT AVG(income) FROM ...))`	Fills missing numeric fields with a global average	Simple and consistent for reporting, though may reduce data variance
Replace Missing Numeric Values (Group Mean)	`LEFT JOIN` on grouped `AVG()`	Calculates and fills missing values with participant-specific means	More realistic for longitudinal or panel data (e.g., income per patient)
Forward Fill (Downward)	`LAST_VALUE(col) OVER (PARTITION BY id ORDER BY time)`	Propagates the last known non-null value forward	Use when variables remain stable over time (e.g., race, gender)
Backward Fill (Upward)	`FIRST_VALUE(col) OVER (PARTITION BY id ORDER BY time ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)`	Fills missing values upward (from next known observation)	Complements forward fill for complete coverage
Type Casting for Numeric Fields	`CAST(col AS NUMERIC)`	Converts text-encoded numbers to true numeric type	Needed before aggregation, averages, or mathematical operations
String Cleaning (Regex)	`REGEXP_REPLACE(text, '[^A-Za-z\\s]', '', 'g')`	Removes non-letter characters or formatting noise	Prepares identifiers or names for joins and standardization
Date Conversion	`TO_DATE(col, 'YYYY-MM-DD')`	Converts character dates to date objects	Enables date arithmetic, filtering, and time-based grouping
Date Component Extraction	`EXTRACT(YEAR FROM date_col)`	Retrieves year, month, or day components	Supports monthly/annual trend analysis
Filtering Complete Cases	`WHERE col1 IS NOT NULL AND col2 IS NOT NULL`	Keeps only fully complete rows for analysis	Appropriate for strict models or visualizations that can’t handle nulls

Analytical Insights

Profiling comes before cleaning. Always understand the pattern and reason behind missingness before applying fixes.
Context drives imputation choice. Not every missing value should be replaced — sometimes absence itself carries information (e.g., missing death date = still alive).
Group-based imputations preserve individual structure. Replacing income within each participant group respects the natural data hierarchy.
Sequential fills replicate temporal logic. Forward and backward filling are essential for visit-based or longitudinal data where attributes should persist across time.
Documentation ensures transparency. Each step in the SQL workflow can be reviewed, audited, or reproduced by other analysts.

Data Casting & Cleaning (Regex, Numeric/Text Conversion)

Real-world datasets often contain inconsistent or improperly formatted data — text in numeric fields, extra spaces, mixed casing, or special characters in names. These issues arise during data collection, system exports, or file conversions, and they can silently break joins, aggregations, or numeric calculations.

This section focuses on casting, cleaning, and standardizing key variables in the visit_hospital dataset.
We’ll work through examples that show how SQL handles type conversion, regex-based cleanup, and normalization of text fields.

1️⃣ Understanding Data Type Inconsistencie

Before fixing, it’s important to detect type or formatting inconsistencies.
For example, numeric columns may contain text artifacts ('N/A', 'unknown', '#NULL!'), and text columns may contain hidden characters or mixed case.

-- Preview some raw data for inspection
SELECT id_text, income, fips, zip, passport
FROM public.visit_hospital
LIMIT 12;

Displaying records 1 - 10
id_text	income	fips	zip	passport
Hatt\|Nade\|1986-07-04	86384	25017	2468	X61479569X
Cars\|Litt\|2012-09-02	71169	44007	2861	NA
Anit\|Sánc\|1960-05-07	61016	25009	1915	X9157439X
Doll\|Fran\|2008-10-04	188023	NA	NA	NA
Hatt\|Nade\|1986-07-04	NA	25017	2468	X61479569X
Cars\|Litt\|2012-09-02	NA	44007	2861	NA
Anit\|Sánc\|1960-05-07	NA	25009	1915	X9157439X
Doll\|Fran\|2008-10-04	NA	NA	NA	NA
Hatt\|Nade\|1986-07-04	85384	25017	2468	X61479569X
Cars\|Litt\|2012-09-02	70169	44007	2861	NA

This step gives a sense of what we’re dealing. For example look if income which is a numeric variable contains null strings, or fips and zip contain unexpected symbols or missing digits. Knowing what’s wrong guides the cleaning strategy.

– Identify non-numeric entries in columns expected to be numeric ALTER TABLE public.visit_hospital ADD COLUMN IF NOT EXISTS income_raw TEXT, ADD COLUMN IF NOT EXISTS zip_raw TEXT, ADD COLUMN IF NOT EXISTS fips_raw TEXT, ADD COLUMN IF NOT EXISTS marital_raw TEXT, ADD COLUMN IF NOT EXISTS city_raw TEXT, ADD COLUMN IF NOT EXISTS passport_raw TEXT;

UPDATE public.visit_hospital SET income_raw = income::text, zip_raw = zip::text, fips_raw = fips::text, marital_raw = marital, city_raw = city, passport_raw = passport;

– Generate a numeric “row number” to use with % WITH numbered AS ( SELECT id, row_number() OVER (ORDER BY id) AS rn FROM public.visit_hospital ) UPDATE public.visit_hospital AS v SET income_raw = CASE WHEN n.rn % 10 = 0 THEN ‘N/A’ WHEN n.rn % 15 = 0 THEN ‘#NULL!’ ELSE v.income_raw END, zip_raw = CASE WHEN n.rn % 12 = 0 THEN ‘123’ WHEN n.rn % 13 = 0 THEN ‘12-345’ WHEN n.rn % 14 = 0 THEN ‘A12B3’ ELSE v.zip_raw END, fips_raw = CASE WHEN n.rn % 11 = 0 THEN ‘abc12’ WHEN n.rn % 17 = 0 THEN ‘1234’ ELSE v.fips_raw END, marital_raw = CASE WHEN n.rn % 8 = 0 THEN ‘Single’ WHEN n.rn % 9 = 0 THEN ‘single’ WHEN n.rn % 10 = 0 THEN ‘SINGLE’ ELSE v.marital_raw END, city_raw = CASE WHEN n.rn % 20 = 0 THEN ’ new york ’ WHEN n.rn % 21 = 0 THEN ‘CH!CAGO’ WHEN n.rn % 22 = 0 THEN ‘los_angeles’ ELSE v.city_raw END, passport_raw = CASE WHEN n.rn % 25 = 0 THEN ’’ WHEN n.rn % 26 = 0 THEN NULL ELSE v.passport_raw END FROM numbered AS n WHERE v.id = n.id;

2️⃣ Casting Numeric-Like Columns If a column expected to be numeric contains non-numeric text, direct aggregations like SUM() or AVG() will fail. We can safely convert these columns by filtering or replacing invalid entries.

2.1 Checking Invalid Numeric Values

-- Identify non-numeric entries in columns expected to be numeric
SELECT id_text, income
FROM public.visit_hospital
WHERE income::text !~ '^[0-9]+(\.[0-9]+)?$'
  AND income IS NOT NULL
LIMIT 10;

0 records
id_text	income

The regex ‘^-?(.)?$’ detects only valid integers or decimals. Anything else (e.g., ‘N/A’, ‘#NULL!’, ‘unknown’) is non-numeric. The !~ operator in PostgreSQL checks for values that do not match the regex.

2.2 Convert to Numeric Safely

SELECT
  id_text,
  income,
  income AS income_cleaned
FROM public.visit_hospital
LIMIT 12;

Displaying records 1 - 10
id_text	income	income_cleaned
Hatt\|Nade\|1986-07-04	86384	86384
Cars\|Litt\|2012-09-02	71169	71169
Anit\|Sánc\|1960-05-07	61016	61016
Doll\|Fran\|2008-10-04	188023	188023
Hatt\|Nade\|1986-07-04	NA	NA
Cars\|Litt\|2012-09-02	NA	NA
Anit\|Sánc\|1960-05-07	NA	NA
Doll\|Fran\|2008-10-04	NA	NA
Hatt\|Nade\|1986-07-04	85384	85384
Cars\|Litt\|2012-09-02	70169	70169

This avoids errors by casting only valid numeric entries. Invalid strings are replaced with NULL, allowing numeric aggregations to run safely.

3️⃣ Text Cleaning with Regex and Standardization

3.1 Clean City and Name Fields

ALTER TABLE public.visit_hospital
  ADD COLUMN IF NOT EXISTS city_raw TEXT;

UPDATE public.visit_hospital
SET city_raw = city;

-- Remove special characters and extra spaces from text
SELECT
  id_text,
  city_raw,
  REGEXP_REPLACE(TRIM(city_raw), '[^A-Za-z\\s]', '', 'g') AS city_cleaned
FROM public.visit_hospital
LIMIT 12;

Displaying records 1 - 10
id_text	city_raw	city_cleaned
Hatt\|Nade\|1986-07-04	Newton	Newton
Cars\|Litt\|2012-09-02	Attleboro	Attleboro
Anit\|Sánc\|1960-05-07	Beverly	Beverly
Doll\|Fran\|2008-10-04	Shrewsbury	Shrewsbury
Hatt\|Nade\|1986-07-04	Newton	Newton
Cars\|Litt\|2012-09-02	Attleboro	Attleboro
Anit\|Sánc\|1960-05-07	Beverly	Beverly
Doll\|Fran\|2008-10-04	Shrewsbury	Shrewsbury
Hatt\|Nade\|1986-07-04	Newton	Newton
Cars\|Litt\|2012-09-02	Attleboro	Attleboro

TRIM() removes leading and trailing spaces.

REGEXP_REPLACE() strips non-alphabetic characters (e.g., “CH!CAGO” → “CHCAGO”).

These techniques clean text for better grouping and joining.

3.2 Standardize Text Casing

-- Normalize categorical text casing
SELECT
  id_text,
  marital,
  LOWER(marital) AS marital_lower,
  INITCAP(LOWER(marital)) AS marital_title
FROM public.visit_hospital
LIMIT 12;

Displaying records 1 - 10
id_text	marital	marital_lower	marital_title
Hatt\|Nade\|1986-07-04	M	m	M
Cars\|Litt\|2012-09-02	NA	NA	NA
Anit\|Sánc\|1960-05-07	W	w	W
Doll\|Fran\|2008-10-04	NA	NA	NA
Hatt\|Nade\|1986-07-04	M	m	M
Cars\|Litt\|2012-09-02	NA	NA	NA
Anit\|Sánc\|1960-05-07	W	w	W
Doll\|Fran\|2008-10-04	NA	NA	NA
Hatt\|Nade\|1986-07-04	M	m	M
Cars\|Litt\|2012-09-02	NA	NA	NA

Mixed casing can create false duplicates (“Single”, “single”, “SINGLE”). LOWER() ensures uniformity for analysis, while INITCAP() is ideal for presentation-ready formats.

4️⃣ Cleaning and Padding Identifiers

4.1 ZIP Code Cleaning

ALTER TABLE public.visit_hospital
  ADD COLUMN IF NOT EXISTS zip_raw TEXT;

UPDATE public.visit_hospital
SET zip_raw = zip::text
WHERE zip_raw IS NULL;

-- Remove non-numeric characters and pad to 5 digits
SELECT
  id_text,
  zip_raw,
  LPAD(REGEXP_REPLACE(zip_raw, '[^0-9]', '', 'g'), 5, '0') AS zip_cleaned
FROM public.visit_hospital
LIMIT 12;

Displaying records 1 - 10
id_text	zip_raw	zip_cleaned
Hatt\|Nade\|1986-07-04	2468	02468
Cars\|Litt\|2012-09-02	2861	02861
Anit\|Sánc\|1960-05-07	1915	01915
Doll\|Fran\|2008-10-04	NA	NA
Hatt\|Nade\|1986-07-04	2468	02468
Cars\|Litt\|2012-09-02	2861	02861
Cars\|Litt\|2012-09-02	2861	02861
Anit\|Sánc\|1960-05-07	1915	01915
Doll\|Fran\|2008-10-04	NA	NA
Hatt\|Nade\|1986-07-04	2468	02468

REGEXP_REPLACE() removes dashes, letters, and spaces; LPAD() ensures all ZIP codes have five digits (‘123’ → ‘00123’).

4.2 FIPS Code Cleaning

ALTER TABLE public.visit_hospital
  ADD COLUMN IF NOT EXISTS fips_raw TEXT;

UPDATE public.visit_hospital
SET fips_raw = fips::text
WHERE fips_raw IS NULL;

-- Clean and normalize FIPS codes
SELECT
  id_text,
  fips_raw,
  LPAD(REGEXP_REPLACE(fips_raw, '[^0-9]', '', 'g'), 5, '0') AS fips_cleaned
FROM public.visit_hospital
LIMIT 12;

Displaying records 1 - 10
id_text	fips_raw	fips_cleaned
Hatt\|Nade\|1986-07-04	25017	25017
Cars\|Litt\|2012-09-02	44007	44007
Anit\|Sánc\|1960-05-07	25009	25009
Doll\|Fran\|2008-10-04	NA	NA
Hatt\|Nade\|1986-07-04	25017	25017
Cars\|Litt\|2012-09-02	44007	44007
Cars\|Litt\|2012-09-02	44007	44007
Anit\|Sánc\|1960-05-07	25009	25009
Doll\|Fran\|2008-10-04	NA	NA
Hatt\|Nade\|1986-07-04	25017	25017

Standardized FIPS codes ensure geographic consistency. Removing non-numeric characters prevents join failures when linking with regional or census datasets.

5️⃣ Categorical & Identifier Normalization

ALTER TABLE public.visit_hospital
  ADD COLUMN IF NOT EXISTS marital_raw TEXT;

UPDATE public.visit_hospital
SET marital_raw = marital
WHERE marital_raw IS NULL;

-- Identify distinct marital variants
SELECT DISTINCT marital_raw
FROM public.visit_hospital
ORDER BY marital_raw;

5 records
marital_raw
D
M
S
W
NA

Quickly reveals variations such as “single”, “Single”, and “SINGLE” — common in manually entered survey or registry data.

6️⃣ Before vs After: Data Cleaning Summary View

The next query displays raw vs. cleaned columns side-by-side, allowing a clear visual comparison for documentation or dashboards.

-- Add raw text versions of numeric/text fields for cleaning
ALTER TABLE public.visit_hospital
  ADD COLUMN IF NOT EXISTS income_raw   TEXT,
  ADD COLUMN IF NOT EXISTS zip_raw      TEXT,
  ADD COLUMN IF NOT EXISTS fips_raw     TEXT,
  ADD COLUMN IF NOT EXISTS marital_raw  TEXT,
  ADD COLUMN IF NOT EXISTS city_raw     TEXT,
  ADD COLUMN IF NOT EXISTS passport_raw TEXT;

-- Populate them from existing columns
UPDATE public.visit_hospital
SET
  income_raw   = income::text,
  zip_raw      = zip::text,
  fips_raw     = fips::text,
  marital_raw  = marital,
  city_raw     = city,
  passport_raw = passport
WHERE income_raw   IS NULL
   OR zip_raw      IS NULL
   OR fips_raw     IS NULL
   OR marital_raw  IS NULL
   OR city_raw     IS NULL
   OR passport_raw IS NULL;

-- Create a clean comparison view of selected variables
SELECT
  id_text,
  income_raw,
  CASE
    WHEN income_raw ~ '^\s*-?\d+(\.\d+)?\s*$'
    THEN CAST(TRIM(income_raw) AS NUMERIC)
    ELSE NULL
  END AS income_cleaned,
  zip_raw,
  LPAD(REGEXP_REPLACE(zip_raw, '[^0-9]', '', 'g'), 5, '0') AS zip_cleaned,
  fips_raw,
  LPAD(REGEXP_REPLACE(fips_raw, '[^0-9]', '', 'g'), 5, '0') AS fips_cleaned,
  marital_raw,
  INITCAP(LOWER(marital_raw)) AS marital_cleaned,
  city_raw,
  REGEXP_REPLACE(TRIM(city_raw), '[^A-Za-z\\s]', '', 'g') AS city_cleaned,
  passport_raw,
  COALESCE(NULLIF(TRIM(passport_raw), ''), 'UNKNOWN') AS passport_cleaned
FROM public.visit_hospital
ORDER BY id_text
LIMIT 20;

Displaying records 1 - 10
id_text	income_raw	income_cleaned	zip_raw	zip_cleaned	fips_raw	fips_cleaned	marital_raw	marital_cleaned	city_raw	city_cleaned	passport_raw	passport_cleaned
Anit\|Sánc\|1960-05-07	NA	NA	1915	01915	25009	25009	W	W	Beverly	Beverly	X9157439X	X9157439X
Anit\|Sánc\|1960-05-07	61016	61016	1915	01915	25009	25009	W	W	Beverly	Beverly	X9157439X	X9157439X
Anit\|Sánc\|1960-05-07	60016	60016	1915	01915	25009	25009	W	W	Beverly	Beverly	X9157439X	X9157439X
Anit\|Sánc\|1960-05-07	NA	NA	1915	01915	25009	25009	W	W	Beverly	Beverly	X9157439X	X9157439X
Aure\|Weis\|2003-11-12	NA	NA	1606	01606	25027	25027	NA	NA	Worcester	Worcester	X17586296X	X17586296X
Aure\|Weis\|2003-11-12	7752	7752	1606	01606	25027	25027	NA	NA	Worcester	Worcester	X17586296X	X17586296X
Aure\|Weis\|2003-11-12	8752	8752	1606	01606	25027	25027	NA	NA	Worcester	Worcester	X17586296X	X17586296X
Aure\|Weis\|2003-11-12	NA	NA	1606	01606	25027	25027	NA	NA	Worcester	Worcester	X17586296X	X17586296X
Cars\|Litt\|2012-09-02	71169	71169	2861	02861	44007	44007	NA	NA	Attleboro	Attleboro	NA	UNKNOWN
Cars\|Litt\|2012-09-02	NA	NA	2861	02861	44007	44007	NA	NA	Attleboro	Attleboro	NA	UNKNOWN

This “before and after” display is ideal for portfolio demonstrations — clearly showing how SQL transformations standardize messy data.

🧮 Summary of Techniques and Insights

Task / Goal	SQL Technique	Purpose / What It Fixes	Analytical Context / When to Use
Detect invalid numeric entries	`!~ '^\s-?\d+(\.\d+)?\s$'`	Finds non-numeric strings in numeric columns	Ensures numeric operations (e.g., AVG, SUM) don’t fail or misreport
Safe numeric casting	`CASE WHEN col ~ regex THEN CAST(col AS NUMERIC)`	Converts only valid entries to numeric	Prevents casting errors during data type conversions
Trim and clean text fields	`TRIM()`	Removes extra spaces from names or city fields	Useful for manual-entry data or multi-system merges
Remove unwanted characters	`REGEXP_REPLACE(col, '[^A-Za-z\\s]', '', 'g')`	Cleans text of digits, punctuation, and symbols	Standardizes strings for grouping, joining, or machine learning models
Normalize text casing	`LOWER()`, `UPPER()`, `INITCAP()`	Converts inconsistent casing to standard formats	Ensures accurate grouping and consistent visual presentation
Clean ZIP and FIPS codes	`REGEXP_REPLACE()` + `LPAD()`	Removes non-numeric characters and pads missing zeros	Prepares identifiers for geographic joins and validation
Handle blank or missing categorical fields	`COALESCE(NULLIF(TRIM(col), ''), 'UNKNOWN')`	Converts blank text and NULLs to clear labels	Makes categorical variables explicit for dashboards and reporting
Detect inconsistent category variants	`SELECT DISTINCT LOWER(col)`	Finds duplicated categories differing only by case	Helps consolidate values before aggregation
Preview before vs. after cleaning	Raw vs. cleaned comparison query	Shows immediate data improvements	Ideal for portfolio documentation and QA verification

Key Takeaways

Data type consistency ensures that every column supports its intended operation (numeric → arithmetic, text → joins).

Regex-based cleaning enables flexible and repeatable correction of messy or user-entered data.

Case and category normalization eliminate hidden duplicates, improving grouping accuracy.

Padding identifiers keeps cross-referenced codes (like ZIP or FIPS) aligned with external sources.

Blank-field handling with COALESCE() ensures clarity between “missing” and “not applicable.”

SQL-first cleaning supports reproducibility and scalability — ideal for enterprise pipelines and analytical environments.

Date & Time Basics (conversion, extraction, validation)

Dates and times are among the most critical yet error-prone data types in analytics.
Whether you’re working with patient visits, transactions, or time-series indicators, proper handling of temporal data ensures analytical accuracy and chronological integrity.

This section demonstrates key SQL techniques for date conversion, validation, and temporal analysis using the `visit_hospital` dataset.

Inspecting and Profiling Date Columns

The dataset includes birthdate, deathdate, and a visit number field (visit).
Let’s quickly inspect their contents and verify their types.

-- Review the date columns to understand their current formats
SELECT
  id_text,
  birthdate,
  deathdate,
  visit
FROM public.visit_hospital
LIMIT 10;

Displaying records 1 - 10
id_text	birthdate	deathdate	visit
Hatt\|Nade\|1986-07-04	NA	NA	1
Cars\|Litt\|2012-09-02	NA	NA	1
Anit\|Sánc\|1960-05-07	NA	NA	1
Doll\|Fran\|2008-10-04	NA	NA	1
Cars\|Litt\|2012-09-02	NA	NA	2
Cars\|Litt\|2012-09-02	NA	NA	3
Anit\|Sánc\|1960-05-07	NA	NA	3
Doll\|Fran\|2008-10-04	NA	NA	3
Hatt\|Nade\|1986-07-04	NA	NA	4
Cars\|Litt\|2012-09-02	NA	NA	4

Creating Raw Text Versions of Date Fields

This initial inspection gives a quick sense of what we’re working with — typically clean and consistent at this stage.

Detect Invalid or Impossible Dates

A common quality check involves identifying logically impossible dates, such as birthdates in the future or deathdates earlier than birthdates.

-- Classify raw dates by validity and pattern
WITH parts AS (
  SELECT
    id_text,
    birthdate_raw,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 1)::int
    END AS year,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 2)::int
    END AS month,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 3)::int
    END AS day
  FROM public.visit_hospital
),
classified AS (
  SELECT
    id_text,
    birthdate_raw,
    CASE
      WHEN year IS NOT NULL THEN
        CASE
          WHEN month BETWEEN 1 AND 12
           AND day   BETWEEN 1 AND 31
          THEN 'Valid YYYY-MM-DD'
          ELSE 'Invalid YYYY-MM-DD'
        END
      ELSE 'Other/Unknown Format'
    END AS pattern
  FROM parts
)
SELECT pattern, COUNT(*) AS record_count
FROM classified
GROUP BY pattern
ORDER BY record_count DESC;

3 records
pattern	record_count
Other/Unknown Format	32
Valid YYYY-MM-DD	4
Invalid YYYY-MM-DD	4

This profiling identifies all structural variations. Typical outputs include YYYY-MM-DD, YYYY/MM/DD, DD-MM-YYYY, ISO with time, and residual invalid or sentinel categories. Such analysis helps define your cleaning strategy precisely.

Profiling Valid vs. Invalid Date Patterns

We first split potential YYYY-MM-DD entries into numeric components and classify them as valid, invalid, or unknown/other based on logical rules.

-- Classify raw dates by validity and pattern
WITH parts AS (
  SELECT
    id_text,
    birthdate_raw,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 1)::int
    END AS year,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 2)::int
    END AS month,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 3)::int
    END AS day
  FROM public.visit_hospital
),
classified AS (
  SELECT
    id_text,
    birthdate_raw,
    CASE
      WHEN year IS NOT NULL THEN
        CASE
          WHEN month BETWEEN 1 AND 12
           AND day   BETWEEN 1 AND 31
          THEN 'Valid YYYY-MM-DD'
          ELSE 'Invalid YYYY-MM-DD'
        END
      ELSE 'Other/Unknown Format'
    END AS pattern
  FROM parts
)
SELECT pattern, COUNT(*) AS record_count
FROM classified
GROUP BY pattern
ORDER BY record_count DESC;

3 records
pattern	record_count
Other/Unknown Format	32
Valid YYYY-MM-DD	4
Invalid YYYY-MM-DD	4

split_part() extracts year, month, and day components.

The CASE logic validates ranges without immediately casting — avoiding date/time field out of range errors.

This provides a count of valid vs. invalid entries before proceeding.

Previewing Invalid and Unrecognized Formats

To inspect the actual records in those categories, filter directly on the pattern column.

-- Inspect rows with invalid or unknown date formats
WITH parts AS (
  SELECT
    id_text,
    birthdate_raw,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 1)::int
    END AS year,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 2)::int
    END AS month,
    CASE
      WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 3)::int
    END AS day
  FROM public.visit_hospital
),
classified AS (
  SELECT
    id_text,
    birthdate_raw,
    CASE
      WHEN year IS NOT NULL THEN
        CASE
          WHEN month BETWEEN 1 AND 12
           AND day   BETWEEN 1 AND 31
          THEN 'Valid YYYY-MM-DD'
          ELSE 'Invalid YYYY-MM-DD'
        END
      ELSE 'Other/Unknown Format'
    END AS pattern
  FROM parts
)
SELECT id_text, birthdate_raw, pattern
FROM classified
WHERE pattern IN ('Invalid YYYY-MM-DD', 'Other/Unknown Format')
ORDER BY pattern, id_text
LIMIT 20;

Displaying records 1 - 10
id_text	birthdate_raw	pattern
Home\|Quiñ\|1955-10-26	1984-13-24	Invalid YYYY-MM-DD
Home\|Quiñ\|1955-10-26	1984-13-24	Invalid YYYY-MM-DD
Home\|Quiñ\|1955-10-26	1984-13-24	Invalid YYYY-MM-DD
Home\|Quiñ\|1955-10-26	1984-13-24	Invalid YYYY-MM-DD
Anit\|Sánc\|1960-05-07	NA	Other/Unknown Format
Anit\|Sánc\|1960-05-07	NA	Other/Unknown Format
Anit\|Sánc\|1960-05-07	NA	Other/Unknown Format
Anit\|Sánc\|1960-05-07	NA	Other/Unknown Format
Cars\|Litt\|2012-09-02	1986/07/04	Other/Unknown Format
Cars\|Litt\|2012-09-02	1986/07/04	Other/Unknown Format

This step lists actual problematic records — the type of visual audit analysts perform before applying corrections. It allows you to verify whether errors are structural (e.g., 1984-13-24) or represent other issues such as placeholders (NA, blanks, or unrecognized symbols).

Repairing and Formatting Invalid Dates

Invalid entries can be normalized through conditional parsing, string manipulation, or replacement rules. The example below demonstrates three repair techniques:

Convert slashed or dashed formats consistently.
Replace sentinel values (NA, #NULL!) with NULL.
Remove invalid or non-parsable text safely.

-- Clean and standardize textual date representations
SELECT
  id_text,
  birthdate_raw,
  CASE
    WHEN birthdate_raw IN ('NA', '#NULL!', '') THEN NULL
    WHEN birthdate_raw ~ '^\s*\d{4}/\d{2}/\d{2}\s*$' THEN REPLACE(birthdate_raw, '/', '-')  -- unify separator
    WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$' THEN
         TO_CHAR(TO_DATE(TRIM(birthdate_raw), 'DD-MM-YYYY'), 'YYYY-MM-DD')                -- convert to ISO
    WHEN birthdate_raw ~ '^\s*\d{4}-\d{2}-\d{2}\s*$' THEN TRIM(birthdate_raw)             -- already clean
    ELSE NULL
  END AS birthdate_formatted
FROM public.visit_hospital
LIMIT 20;

Displaying records 1 - 10
id_text	birthdate_raw	birthdate_formatted
Hatt\|Nade\|1986-07-04	NA	NA
Cars\|Litt\|2012-09-02	1986/07/04	1986-07-04
Anit\|Sánc\|1960-05-07	NA	NA
Doll\|Fran\|2008-10-04	NA	NA
Cars\|Litt\|2012-09-02	1986/07/04	1986-07-04
Cars\|Litt\|2012-09-02	1986/07/04	1986-07-04
Anit\|Sánc\|1960-05-07	NA	NA
Doll\|Fran\|2008-10-04	NA	NA
Dwig\|Hill\|1984-08-24	NA	NA
Tamb\|Bali\|1953-05-10	NA	NA

This conversion converts all compatible formats into the standard YYYY-MM-DD.

Invalid, unrecognized, or sentinel entries are replaced with NULL.

Using TO_CHAR() ensures consistent ISO formatting for presentation or export.

Date-cleaning logic to ensure that all conversions are type-safe, format-consistent, and free of range errors.

Validating both the structure and value ranges of date components before conversion, we can confidently transform diverse text-based inputs into clean and usable DATE fields.

-- Safely convert and validate mixed-format date strings
WITH cleaned AS (
  SELECT
    id_text,
    CASE
      -- Handle missing or placeholder values
      WHEN birthdate_raw IN ('NA', '#NULL!', '') THEN NULL

      -- Reformat YYYY/MM/DD → YYYY-MM-DD
      WHEN birthdate_raw ~ '^\s*\d{4}/\d{2}/\d{2}\s*$'
        THEN REPLACE(birthdate_raw, '/', '-')
      -- Convert DD-MM-YYYY → YYYY-MM-DD
      WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$'
        THEN TO_CHAR(TO_DATE(TRIM(birthdate_raw), 'DD-MM-YYYY'), 'YYYY-MM-DD')

      -- Keep YYYY-MM-DD only if month/day are in valid ranges
      WHEN birthdate_raw ~ '^\s*\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])\s*$'
        THEN TRIM(birthdate_raw)
      ELSE NULL
    END AS birthdate_clean_text
  FROM public.visit_hospital
)
SELECT
  id_text,
  birthdate_clean_text,
  CASE
    -- Only cast strings that pass full structural and range validation
    WHEN birthdate_clean_text ~ '^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$'
      THEN TO_DATE(birthdate_clean_text, 'YYYY-MM-DD')
    ELSE NULL
  END AS birthdate_final
FROM cleaned
WHERE birthdate_clean_text IS NOT NULL
LIMIT 15;

Displaying records 1 - 10
id_text	birthdate_clean_text	birthdate_final
Cars\|Litt\|2012-09-02	1986-07-04	NA
Cars\|Litt\|2012-09-02	1986-07-04	NA
Cars\|Litt\|2012-09-02	1986-07-04	NA
Aure\|Weis\|2003-11-12	2003-11-12	2003-11-12
Mari\|Lind\|1966-06-23	1986-04-07	1986-04-07
Aure\|Weis\|2003-11-12	2003-11-12	2003-11-12
Mari\|Lind\|1966-06-23	1986-04-07	1986-04-07
Mari\|Lind\|1966-06-23	1986-04-07	1986-04-07
Aure\|Weis\|2003-11-12	2003-11-12	2003-11-12
Mari\|Lind\|1966-06-23	1986-04-07	1986-04-07

The inner CASE logic prepares a standardized YYYY-MM-DD text representation.

Regex guards ensure only syntactically and semantically valid month/day values are converted.

Invalid, blank, or sentinel entries are safely excluded.

The outer CASE performs the final casting into the true PostgreSQL DATE type.

This two-step validation (first reformat, then convert) guarantees that your code runs without any 22008: date/time field value out of range errors, even on messy or mixed-format datasets.

Preview of Successfully Cleaned Dates

-- Preview cleaned and validated dates
WITH cleaned AS (
  SELECT
    id_text,
    CASE
      WHEN birthdate_raw IN ('NA', '#NULL!', '') THEN NULL
      WHEN birthdate_raw ~ '^\s*\d{4}/\d{2}/\d{2}\s*$' THEN REPLACE(birthdate_raw, '/', '-')
      WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$' THEN TO_CHAR(TO_DATE(TRIM(birthdate_raw), 'DD-MM-YYYY'), 'YYYY-MM-DD')
      WHEN birthdate_raw ~ '^\s*\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])\s*$' THEN TRIM(birthdate_raw)
      ELSE NULL
    END AS birthdate_clean_text
  FROM public.visit_hospital
)
SELECT
  id_text,
  birthdate_clean_text,
  TO_DATE(birthdate_clean_text, 'YYYY-MM-DD') AS birthdate_final
FROM cleaned
WHERE birthdate_clean_text IS NOT NULL
ORDER BY id_text
LIMIT 20;

Displaying records 1 - 10
id_text	birthdate_clean_text	birthdate_final
Aure\|Weis\|2003-11-12	2003-11-12	2003-11-12
Aure\|Weis\|2003-11-12	2003-11-12	2003-11-12
Aure\|Weis\|2003-11-12	2003-11-12	2003-11-12
Aure\|Weis\|2003-11-12	2003-11-12	2003-11-12
Cars\|Litt\|2012-09-02	1986-07-04	1986-07-04
Cars\|Litt\|2012-09-02	1986-07-04	1986-07-04
Cars\|Litt\|2012-09-02	1986-07-04	1986-07-04
Cars\|Litt\|2012-09-02	1986-07-04	1986-07-04
Mari\|Lind\|1966-06-23	1986-04-07	1986-04-07
Mari\|Lind\|1966-06-23	1986-04-07	1986-04-07

This preview confirms that only valid and correctly formatted entries remain. All date strings are now standardized in ISO format (YYYY-MM-DD) and safely castable into the DATE type for further analysis.

Detecting Residual Missing or Invalid Entries

It’s useful to summarize how many rows were successfully cleaned versus how many remain missing or invalid. This summary provides a data-quality snapshot — showing the proportion of records that were valid, converted, or excluded.

-- Summarize validity of birthdate entries after cleaning
WITH final_status AS (
  SELECT
    CASE
      WHEN birthdate_raw IN ('NA', '#NULL!', '') THEN 'Missing / Placeholder'
      WHEN birthdate_raw ~ '^\s*\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])\s*$' THEN 'Valid (YYYY-MM-DD)'
      WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$' THEN 'Converted (DD-MM-YYYY)'
      WHEN birthdate_raw ~ '^\s*\d{4}/\d{2}/\d{2}\s*$' THEN 'Converted (YYYY/MM/DD)'
      ELSE 'Other / Invalid'
    END AS category
  FROM public.visit_hospital
)
SELECT category, COUNT(*) AS record_count
FROM final_status
GROUP BY category
ORDER BY record_count DESC;

5 records
category	record_count
Missing / Placeholder	20
Other / Invalid	8
Converted (YYYY/MM/DD)	4
Valid (YYYY-MM-DD)	4
Converted (DD-MM-YYYY)	4

List the invalid and unknown rows

WITH normalized AS (
  SELECT
    id_text,
    birthdate_raw,
    NULLIF(TRIM(birthdate_raw), '') AS raw_trim,
    CASE
      WHEN birthdate_raw ~ '^\s*\d{4}[-/]\d{2}[-/]\d{2}\s*$'
        THEN REGEXP_REPLACE(TRIM(birthdate_raw), '/', '-', 'g')
      WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$'
        THEN TO_CHAR(TO_DATE(TRIM(birthdate_raw), 'DD-MM-YYYY'), 'YYYY-MM-DD')
      ELSE NULL
    END AS ymd_text
  FROM public.visit_hospital
),
parts AS (
  SELECT
    id_text, birthdate_raw, raw_trim, ymd_text,
    CASE WHEN ymd_text ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(ymd_text,'-',1)::int END AS y,
    CASE WHEN ymd_text ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(ymd_text,'-',2)::int END AS m,
    CASE WHEN ymd_text ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(ymd_text,'-',3)::int END AS d
  FROM normalized
),
validated AS (
  SELECT
    id_text, birthdate_raw, ymd_text, y, m, d,
    (m BETWEEN 1 AND 12)
    AND (d BETWEEN 1 AND EXTRACT(DAY FROM (date_trunc('month', make_date(y,m,1))
                                           + INTERVAL '1 month' - INTERVAL '1 day'))) AS is_valid
  FROM parts
)
SELECT
  id_text,
  birthdate_raw,
  ymd_text AS normalized_text,
  y, m, d,
  CASE
    WHEN ymd_text IS NOT NULL AND NOT is_valid THEN 'Invalid (impossible date)'
    WHEN ymd_text IS NULL THEN 'Unknown/Unparsed'
  END AS reason
FROM validated
WHERE (ymd_text IS NOT NULL AND NOT is_valid)
   OR ymd_text IS NULL
  -- exclude explicit placeholders if you like:
  AND COALESCE(TRIM(birthdate_raw),'') NOT IN ('NA','#NULL!','')
ORDER BY reason, id_text
LIMIT 20;

4 records
id_text	birthdate_raw	normalized_text	y	m	d	reason
Home\|Quiñ\|1955-10-26	1984-13-24	1984-13-24	1984	13	24	Invalid (impossible date)
Home\|Quiñ\|1955-10-26	1984-13-24	1984-13-24	1984	13	24	Invalid (impossible date)
Home\|Quiñ\|1955-10-26	1984-13-24	1984-13-24	1984	13	24	Invalid (impossible date)
Home\|Quiñ\|1955-10-26	1984-13-24	1984-13-24	1984	13	24	Invalid (impossible date)

Range Profiling of Cleaned Dates

After cleaning and conversion to valide dates, you can check for chronological plausibility (e.g., no future birthdates).

-- Profile chronological range of cleaned birthdates
WITH validated AS (
  SELECT
    TO_DATE(TRIM(birthdate_raw), 'YYYY-MM-DD') AS birthdate_final
  FROM public.visit_hospital
  WHERE birthdate_raw ~ '^\s*\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])\s*$'
)
SELECT
  MIN(birthdate_final) AS earliest_birth,
  MAX(birthdate_final) AS latest_birth,
  COUNT(*) AS total_valid
FROM validated;

1 records
earliest_birth	latest_birth	total_valid
2003-11-12	2003-11-12	4

Summary of Date Formatting

Step	Technique	Purpose
Pattern profiling	`~` (regex match) + `split_part()`	Identify valid, invalid, or mixed-format entries
Safe reformatting	`REPLACE()` + `TO_CHAR(TO_DATE(...))`	Convert raw input to ISO `YYYY-MM-DD` text safely
Range validation	Regex guard `(0[1-9]\|1[0-2])` for month, etc.	Block out-of-range month/day before conversion
Safe conversion	`CASE WHEN ... THEN TO_DATE()`	Cast only confirmed valid formats to `DATE`
Quality summary	`GROUP BY category`	Provide audit view of data quality categories
Range check	`MIN()` / `MAX()`	Confirm chronological plausibility of cleaned dates

Date & Time Analysis

Profiling the Date of Birth Column

Start with an overview to check completeness and plausible ranges.

-- Profile birthdate data: completeness and range
SELECT
  COUNT(*) AS total_records,
  COUNT(birthdate) AS non_missing_birthdate,
  MIN(birthdate) AS earliest_birth,
  MAX(birthdate) AS latest_birth
FROM public.visit_hospital;

1 records
total_records	non_missing_birthdate	earliest_birth	latest_birth
40	24	1953-05-10	2005-04-25

This confirms the time coverage of your data (e.g., from 1953–2012) and helps detect potential outliers or missing values.

Calculating births by Month and Year

-- Count births by month
SELECT
  DATE_TRUNC('month', birthdate)::date AS birth_month,
  COUNT(*) AS total_births
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
GROUP BY 1
ORDER BY birth_month;

6 records
birth_month	total_births
1953-05-01	4
1955-10-01	4
1966-06-01	4
1984-08-01	4
2003-11-01	4
2005-04-01	4

-- Births by Year (show only the year number)
SELECT
  EXTRACT(YEAR FROM birthdate)::int AS birth_year,
  COUNT(*) AS total_births
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
GROUP BY 1
ORDER BY birth_year;

6 records
birth_year	total_births
1953	4
1955	4
1966	4
1984	4
2003	4
2005	4

Calculating Current Age in Years The AGE() function calculates an interval between two dates, while DATE_PART() extracts the number of years.

-- Compute current age in years for each individual
SELECT
  id_text,
  birthdate,
  DATE_PART('year', AGE(CURRENT_DATE, birthdate)) AS age_years
FROM public.visit_hospital
ORDER BY age_years DESC
LIMIT 15;

Displaying records 1 - 10
id_text	birthdate	age_years
Hatt\|Nade\|1986-07-04	NA	NA
Anit\|Sánc\|1960-05-07	NA	NA
Cars\|Litt\|2012-09-02	NA	NA
Doll\|Fran\|2008-10-04	NA	NA
Doll\|Fran\|2008-10-04	NA	NA
Hatt\|Nade\|1986-07-04	NA	NA
Doll\|Fran\|2008-10-04	NA	NA
Doll\|Fran\|2008-10-04	NA	NA
Cars\|Litt\|2012-09-02	NA	NA
Hatt\|Nade\|1986-07-04	NA	NA

AGE(CURRENT_DATE, birthdate) returns the elapsed interval since birth, and DATE_PART(‘year’, …) extracts the total number of full years.

Calculating Duration from Birthdate to a Reference Date

Interval arithmetic is often used to determine time to a fixed milestone (e.g., 18th or 65th birthday).

-- Time until 65th birthday for each person
SELECT
  id_text,
  birthdate,
  (birthdate + INTERVAL '65 years') AS date_turns_65,
  AGE((birthdate + INTERVAL '65 years'), CURRENT_DATE) AS time_until_65
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
ORDER BY date_turns_65
LIMIT 15;

Displaying records 1 - 10
id_text	birthdate	date_turns_65	time_until_65
Tamb\|Bali\|1953-05-10	1953-05-10	2018-05-10	-7 years -6 mons -5 days
Tamb\|Bali\|1953-05-10	1953-05-10	2018-05-10	-7 years -6 mons -5 days
Tamb\|Bali\|1953-05-10	1953-05-10	2018-05-10	-7 years -6 mons -5 days
Tamb\|Bali\|1953-05-10	1953-05-10	2018-05-10	-7 years -6 mons -5 days
Home\|Quiñ\|1955-10-26	1955-10-26	2020-10-26	-5 years -20 days
Home\|Quiñ\|1955-10-26	1955-10-26	2020-10-26	-5 years -20 days
Home\|Quiñ\|1955-10-26	1955-10-26	2020-10-26	-5 years -20 days
Home\|Quiñ\|1955-10-26	1955-10-26	2020-10-26	-5 years -20 days
Mari\|Lind\|1966-06-23	1966-06-23	2031-06-23	5 years 7 mons 8 days
Mari\|Lind\|1966-06-23	1966-06-23	2031-06-23	5 years 7 mons 8 days

By adding intervals to birthdate, you can project future milestones, calculate eligibility periods, or align with policy cutoffs (e.g., retirement age).

Detect Implausible or Anomalous Ages

Identifying outliers helps ensure that demographic insights are grounded in reality.

WITH ages AS (
  SELECT
    id_text,
    DATE_PART('year', AGE(CURRENT_DATE, birthdate)) AS age_years
  FROM public.visit_hospital
  WHERE birthdate IS NOT NULL
)
SELECT
  COUNT(*) FILTER (WHERE age_years < 0)   AS negative_age,
  COUNT(*) FILTER (WHERE age_years > 100) AS unrealistic_age,
  ROUND(AVG(age_years)::numeric, 1)       AS avg_age
FROM ages;

1 records
negative_age	unrealistic_age	avg_age
0	0	47.3

This check ensures no impossible ages (negative or over 100 years) are present. Such records may indicate data-entry or unit conversion issues.

Temporal Trends by Birth Year or Decade Aggregating by birth year or decade reveals how age distributions evolve over time.

-- Group individuals by birth year and decade
SELECT
  FLOOR(EXTRACT(YEAR FROM birthdate) / 10) * 10 AS birth_decade,
  ROUND(CAST(AVG(DATE_PART('year', AGE(CURRENT_DATE, birthdate))) AS NUMERIC), 1) AS avg_age,
  COUNT(*) AS total_individuals
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
GROUP BY birth_decade
ORDER BY birth_decade;

4 records
birth_decade	avg_age	total_individuals
1950	71	8
1960	59	4
1980	41	4
2000	21	8

EXTRACT(YEAR …) pulls the numeric birth year.

Decade grouping (FLOOR(year / 10) * 10) is helpful for longitudinal or generational reporting.

Visualizing Age Distribution Over Time

-- Average current age by birth decade
SELECT
  FLOOR(EXTRACT(YEAR FROM birthdate) / 10) * 10 AS birth_decade,
  ROUND(CAST(AVG(DATE_PART('year', AGE(CURRENT_DATE, birthdate))) AS NUMERIC), 1) AS avg_age,
  COUNT(*) AS total_individuals
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
GROUP BY birth_decade
ORDER BY birth_decade;

4 records
birth_decade	avg_age	total_individuals
1950	71	8
1960	59	4
1980	41	4
2000	21	8

This provides an age summary across decades — a compact representation of how age structure shifts over time.

Part VII — Descriptive Statistics & Tabulation

Aggregations & GROUP BY; HAVING filters
Ordered-set aggregates (percentiles)
Crosstab/pivot with tablefunc extension

Once the data is clean and structured, the next step is summarization and tabulation — transforming granular observations into interpretable statistics.
This section demonstrates descriptive techniques such as aggregation with GROUP BY, conditional filtering with HAVING, calculating percentiles, and reshaping summaries into cross-tabulated (pivoted) formats.

All examples below use the public.global_health_statistics dataset.

1️⃣ Aggregation with `GROUP BY`

Aggregations condense large datasets into concise summaries — totals, averages, and counts by meaningful categories. The GROUP BY clause in SQL is the equivalent of group_by() + summarise() in R.

In data analytics, this is how we turn millions of records into tables that inform decision-making.

In the example below, the query groups data by both country and disease, and computes three key metrics:

The average number of people affected per record,
The total number of people affected (overall disease burden),
The number of records contributing to each group. The example below shows how GROUP BY can be used to generate aggregated insights across multiple dimensions, revealing where disease burdens are highest.

-- Average population affected by disease and country
SELECT
  country,
  disease_name,
  ROUND(AVG(population_affected)::numeric, 0) AS avg_population_affected,
  SUM(population_affected)                     AS total_population_affected,
  COUNT(*)                                     AS records_count
FROM public.global_health_statistics
GROUP BY country, disease_name
ORDER BY total_population_affected DESC
LIMIT 15;

Displaying records 1 - 10
country	disease_name	avg_population_affected	total_population_affected	records_count
Germany	Tuberculosis	994653	24866314	25
Mexico	Ebola	994470	24861757	25
Turkey	Dengue	994385	24859631	25
Argentina	Ebola	994191	24854765	25
France	Hepatitis	994065	24851614	25
Argentina	Diabetes	993911	24847772	25
Mexico	Rabies	993845	24846123	25
South Korea	Parkinson’s Disease	993740	24843491	25
Turkey	HIV/AIDS	993719	24842966	25
Italy	Polio	993676	24841908	25

Context: This type of aggregation could support a quick ranking of health priorities — for instance, identifying which diseases have historically affected the largest populations within each country. Using both AVG() and SUM() together allows you to distinguish between typical impact per observation and overall cumulative impact. The ORDER BY ensures that the most significant results (in terms of total affected population) appear at the top.

2️⃣HAVING for Filtering Aggregated Results

After summarizing, it’s often useful to narrow down results to only those groups that meet specific conditions — for example, diseases with exceptionally high case counts. Unlike WHERE, which filters individual rows before aggregation, HAVING filters aggregated groups after summary calculations.

The example below shows how to isolate high-impact diseases affecting over one million people in total.

-- Show only diseases with more than 1 million people affected
SELECT
  disease_name,
  SUM(population_affected) AS total_population
FROM public.global_health_statistics
GROUP BY disease_name
HAVING SUM(population_affected) > 1000000
ORDER BY total_population DESC
LIMIT 10;

Displaying records 1 - 10
disease_name	total_population
Polio	495323206
Tuberculosis	495320883
Parkinson’s Disease	495304287
Ebola	495254480
Hypertension	495155070
Cholera	495153078
Alzheimer’s Disease	495140228
Zika	495121139
COVID-19	495111737
Cancer	495100583

Here, HAVING ensures we only report diseases where the cumulative affected population exceeds 1 million. Context: This query is useful when performing threshold-based analysis — such as identifying key health challenges that exceed certain severity levels. It could easily be adapted for different analytical questions, such as finding diseases with a rapid increase over time or regions exceeding target incidence rates.

3️⃣ Ordered-Set Aggregates: Percentiles & Medians

While aggregates like AVG() provide a measure of central tendency, percentiles and medians describe distribution and spread. They are especially useful when data contains outliers — for example, one extreme outbreak could distort the average, while the median gives a more typical view.

The example below shows how to compute quartiles (25th, 50th, 75th percentiles) of population affected for each disease.

-- Compute median and quartiles of affected population by disease
SELECT
  disease_name,
  PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY population_affected) AS q1,
  PERCENTILE_CONT(0.5)  WITHIN GROUP (ORDER BY population_affected) AS median,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY population_affected) AS q3
FROM public.global_health_statistics
WHERE population_affected IS NOT NULL
GROUP BY disease_name
ORDER BY median DESC
LIMIT 10;

Displaying records 1 - 10
disease_name	q1	median	q3
Zika	986952.5	993871.0	997424.8
COVID-19	986470.8	993712.0	996905.0
Parkinson’s Disease	986463.0	993632.5	997368.5
Cholera	987292.5	993503.5	997038.0
Polio	987037.5	993333.5	996737.0
Measles	985969.0	993319.5	997046.5
Leprosy	985762.0	993277.5	997048.2
Diabetes	987171.0	993161.0	997301.8
Tuberculosis	987507.0	993129.5	997328.0
Ebola	986789.0	993120.5	997368.8

Context: Percentiles help identify which diseases have the widest or most concentrated spread of population impact. For instance, a disease with a large gap between the 25th and 75th percentile likely affects different regions very unevenly — an insight valuable for targeted interventions.

4️⃣Reshaping Data: Cross-Tabulation (Pivoting Data)

Data reshaping refers to transforming the structure of a dataset — switching between long and wide formats depending on analytical or reporting needs.
In long format, each observation occupies one row (e.g., each record represents a country–year–disease combination).
In wide format, a single row may represent a country or year, and multiple variables or categories become columns (e.g., diseases side by side).

Reshaping is a key part of preparing data for visualizations, dashboards, and comparative summaries.
While SQL is inherently row-oriented, PostgreSQL’s tablefunc extension and some creative joins make reshaping feasible directly within SQL.

These transformations are common in epidemiology, public health reporting, and data science workflows: - Analysts prefer long format for modeling and statistical analysis.
- Decision-makers and stakeholders often prefer wide format for dashboards, comparisons, and quick scanning.

The following examples demonstrate both wide and long reshaping operations using the global_health_statistics dataset.

4.1: Reshaping to Wide Format

Pivoting data turns long-format tables into wide-format summaries, similar to Excel pivot tables or R’s pivot_wider(). This is particularly useful for comparing values side by side — such as annual totals per disease.

In PostgreSQL, pivoting can be done using the crosstab() function from the tablefunc extension.

Why important: Stakeholders often ask for “diseases across columns by year.” The example below shows how I pivot totals by year into a wide table for quick scanning.

-- Enable the extension (run once per database)
CREATE EXTENSION IF NOT EXISTS tablefunc;

SELECT *
FROM crosstab(
  $$
    SELECT year_event::INT, disease_name, SUM(population_affected)::NUMERIC
    FROM public.global_health_statistics
    GROUP BY year_event, disease_name
    ORDER BY year_event, disease_name
  $$
) AS ct(
  year_event INT,
  malaria NUMERIC,
  dengue NUMERIC,
  influenza NUMERIC,
  covid19 NUMERIC
)
LIMIT 10;

Displaying records 1 - 10
year_event	malaria	dengue	influenza	covid19
2000	19835840	19806950	19801964	19865735
2001	19794559	19817181	19824040	19822067
2002	19836984	19816082	19759866	19784016
2003	19773534	19704438	19764751	19803744
2004	19792759	19825219	19778123	19787008
2005	19850519	19787102	19870901	19746345
2006	19846542	19769208	19814068	19747205
2007	19847607	19821630	19766503	19760524
2008	19735264	19848897	19823537	19756148
2009	19718827	19763020	19770930	19802159

Context: This example shows how a long transactional dataset can be reshaped into a compact, comparative view where each disease becomes a column. Such output is ideal for visualizing trends or preparing data for dashboards (e.g., year-over-year change in disease prevalence).

4.2: Reshaping to Long Format

Long format is often preferred for modeling, analytics, and machine learning pipelines because it’s tidy — every variable has its own column, and every observation has its own row. When you receive data in a wide structure (for example, one column per disease), converting it back to long format makes it easier to aggregate, filter, or compute metrics by category.

Why this matters: Analytical workflows (e.g., regression, forecasting, clustering) often expect a normalized long structure. This format aligns better with relational database principles and is easier to extend when new diseases or measures appear.

In PostgreSQL, this can be achieved with UNION ALL or the more compact UNNEST() approach.

SELECT year_event, disease_name, population_affected
FROM (
    SELECT *
    FROM crosstab(
      $$
        SELECT year_event::INT, disease_name, SUM(population_affected)::NUMERIC
        FROM public.global_health_statistics
        GROUP BY year_event, disease_name
        ORDER BY year_event, disease_name
      $$
    ) AS ct(
      year_event INT,
      malaria NUMERIC,
      dengue NUMERIC,
      influenza NUMERIC,
      covid19 NUMERIC
    )
) crosstab_result
CROSS JOIN LATERAL unnest(
    ARRAY['malaria','dengue','influenza','covid19'],
    ARRAY[malaria,dengue,influenza,covid19]
) AS t(disease_name, population_affected)
ORDER BY year_event, disease_name;

Displaying records 1 - 10
year_event	disease_name	population_affected
2000	covid19	19865735
2000	dengue	19806950
2000	influenza	19801964
2000	malaria	19835840
2001	covid19	19822067
2001	dengue	19817181
2001	influenza	19824040
2001	malaria	19794559
2002	covid19	19784016
2002	dengue	19816082

ARRAY[…] defines disease names.
ARRAY[…] holds the corresponding values from each column.
UNNEST expands them into long rows.

Another option to achieve the same result could be using UNION ALL:

Below is a demonstration using UNION ALL to manually “unpivot” wide data into long form.

-- Example: manually reshaping a wide table back to long format
-- (assuming we already have a pivoted summary table)
WITH disease_wide AS (
  SELECT *
  FROM crosstab(
    $$
      SELECT year_event::INT, disease_name, SUM(population_affected)::NUMERIC
      FROM public.global_health_statistics
      GROUP BY year_event, disease_name
      ORDER BY year_event, disease_name
    $$
  ) AS ct(
    year_event INT,
    malaria NUMERIC,
    dengue NUMERIC,
    influenza NUMERIC,
    covid19 NUMERIC
  )
)
SELECT year_event, 'malaria'   AS disease_name, malaria   AS population_affected FROM disease_wide
UNION ALL
SELECT year_event, 'dengue',     dengue   FROM disease_wide
UNION ALL
SELECT year_event, 'influenza',  influenza FROM disease_wide
UNION ALL
SELECT year_event, 'covid19',    covid19   FROM disease_wide
ORDER BY year_event, disease_name
LIMIT 20;

Displaying records 1 - 10
year_event	disease_name	population_affected
2000	covid19	19865735
2000	dengue	19806950
2000	influenza	19801964
2000	malaria	19835840
2001	covid19	19822067
2001	dengue	19817181
2001	influenza	19824040
2001	malaria	19794559
2002	covid19	19784016
2002	dengue	19816082

5️⃣ Multi-Level Summaries

When you want to track patterns over time, it’s helpful to summarize data at multiple hierarchical levels. The query below aggregates the total population affected by year and country, producing a panel-like view suitable for temporal trend analysis.

-- Yearly total affected population per country
SELECT
  country,
  year_event,
  SUM(population_affected) AS total_affected
FROM public.global_health_statistics
GROUP BY country, year_event
ORDER BY country, year_event;

Displaying records 1 - 10
country	year_event	total_affected
Argentina	2000	19817875
Argentina	2001	19835072
Argentina	2002	19839406
Argentina	2003	19825531
Argentina	2004	19868546
Argentina	2005	19803811
Argentina	2006	19873224
Argentina	2007	19812365
Argentina	2008	19796552
Argentina	2009	19806823

Context: This example demonstrates how a simple GROUP BY over multiple dimensions can reveal longitudinal trends — such as whether a country’s total burden is increasing or decreasing over time. The resulting dataset could serve as the basis for line charts or heatmaps in further visualization steps.

🔎 Summary Table: Descriptive Statistics & Reshaping Techniques

Technique	Purpose	SQL Example	When to Use
`GROUP BY` + aggregates (`SUM`, `AVG`, `COUNT`)	Collapse raw rows into grouped summaries	`SELECT country, disease_name, SUM(population_affected) FROM ... GROUP BY country, disease_name;`	Rank burdens by country/disease, create quick summaries
`HAVING` filters	Filter groups after aggregation	`... GROUP BY disease_name HAVING SUM(population_affected) > 1000000;`	Identify only high-impact diseases or threshold exceedances
Ordered-set aggregates (`PERCENTILE_CONT`)	Compute medians, quartiles, or percentiles	`SELECT disease_name, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY population_affected) FROM ... GROUP BY disease_name;`	Understand distributions, reduce impact of outliers
`crosstab()` (wide pivot)	Reshape long data into wide format (like Excel pivot tables)	`SELECT * FROM crosstab($$SELECT year_event, disease_name, SUM(population_affected) ...$$) AS ct(...);`	Dashboards, year-over-year comparisons, stakeholder summaries
`UNNEST()` / `UNION ALL` (long unpivot)	Convert wide back to long/tidy format	`SELECT year_event, disease_name, population_affected FROM wide CROSS JOIN LATERAL unnest(...);`	Feed into modeling, visualization, machine learning
Multi-level summaries	Aggregate across multiple hierarchical dimensions	`SELECT country, year_event, SUM(population_affected) FROM ... GROUP BY country, year_event;`	Trend analysis across time and geography

Key Insight:
- Wide format is great for comparative dashboards, stakeholder reporting, and scanning.
- Long format is better for tidy analysis, visualisation, and interoperability with tools like R or Python.
- SQL’s flexibility (GROUP BY, HAVING, PERCENTILE_CONT, crosstab, UNNEST) allows analysts to move fluidly between exploratory statistics and presentation-ready summaries.

Key Takeaways

Descriptive SQL transforms raw transactional data into interpretable insights through summarization and restructuring.

The ability to move fluidly between long and wide formats reflects practical analytical flexibility — catering to both technical and decision-maker audiences.

Part VIII — Performance Basics

Database performance tuning is an essential skill when working with large datasets such as public.global_health_statistics. Even seemingly simple queries can become very slow if the database needs to scan millions of rows. This section demonstrates three core performance techniques — indexing, reading query plans, and simple tuning habits — using examples from the dataset. Performance optimization in SQL typically involves:

Creating the right indexes
Understanding how queries are executed with EXPLAIN ANALYZE
Applying simple tuning strategies to improve speed

1. Indexing Strategies

SQL databases use B-tree indexes by default. They work like a sorted lookup structure that makes finding rows faster.

When to use indexes: - On filtering columns (WHERE) - On grouping keys (GROUP BY) - On ordering columns (ORDER BY) - On join keys between tables

Indexes act like a book’s table of contents: instead of flipping through every page, the database can jump directly to the rows it needs. For instance:

-- Without an index: PostgreSQL will likely do a sequential scan
EXPLAIN ANALYZE
SELECT *
FROM public.global_health_statistics
WHERE country = 'Australia';

7 records
QUERY PLAN
Bitmap Heap Scan on global_health_statistics (cost=8.16..1796.94 rows=500 width=168) (actual time=0.143..0.442 rows=500 loops=1)
Recheck Cond: (country = ‘Australia’::text)
Heap Blocks: exact=497
-> Bitmap Index Scan on idx_ghs_country (cost=0.00..8.04 rows=500 width=0) (actual time=0.091..0.091 rows=500 loops=1)
Index Cond: (country = ‘Australia’::text)
Planning Time: 0.077 ms
Execution Time: 0.467 ms

Indexes speed up queries by allowing the database to look up rows efficiently rather than scanning the entire table.

CREATE INDEX IF NOT EXISTS idx_ghs_country ON public.global_health_statistics (country);

– Re-run the query EXPLAIN ANALYZE SELECT * FROM public.global_health_statistics WHERE country = ‘Australia’;

With the index in place, the plan switches to an Index Scan, reducing I/O dramatically. On large datasets, this difference is measured in seconds or minutes saved e.g., from 2.9 micro seconds to 0.88 micro seconds in this example (see the screenshot below).

The following screenshot shows the execution time difference observed before and after adding an index:

PostgreSQL query plan comparison

2. Reading Query Plans with EXPLAIN ANALYZE

Indexes are only useful if the database actually uses them. That’s where EXPLAIN ANALYZE comes in. It shows the chosen plan and the actual execution time.

– Create index on year_event CREATE INDEX IF NOT EXISTS idx_ghs_year_event ON public.global_health_statistics (year_event);

– Re-run the query to see faster plan EXPLAIN ANALYZE SELECT year_event, COUNT(*) AS n_records FROM public.global_health_statistics GROUP BY year_event ORDER BY year_event;

The output explains whether PostgreSQL used sorting, hashing, or index scans. If grouping by year is frequent, adding an index on year may shift the plan from expensive full sorts to more efficient operations.

3. Simple Tuning Tips

Even without adding new indexes, developers can make queries run faster by following small habits:

a) Predicates: Make Them SARGable

SARGable = Search Argument Able. Avoid wrapping columns in functions.

❌ Non-SARGable

Good (SARGable)

WHERE mortality_pct = 5.0

WHERE year = 2020

b) Projections: Select Only What You Need

– SELECT * – FROM public.global_health_statistics;

SELECT country, year_event, disease_name, mortality_pct
FROM public.global_health_statistics
LIMIT 20;

Displaying records 1 - 10
country	year_event	disease_name	mortality_pct
Nigeria	2013	Asthma	9.15
Turkey	2010	Tuberculosis	2.75
Argentina	2006	Ebola	3.16
Saudi Arabia	2013	Hypertension	5.57
Nigeria	2004	COVID-19	8.77
Russia	2012	Parkinson’s Disease	6.85
Saudi Arabia	2009	Cancer	1.58
Russia	2009	Ebola	4.46
Nigeria	2022	Polio	1.13
India	2016	Leprosy	8.48

c) Ordering: Use Index Support If you frequently order by columns like population_affected, consider an index.

CREATE INDEX idx_population_affected ON public.global_health_statistics(population_affected) LIMIT 15;

SELECT country, population_affected FROM public.global_health_statistics ORDER BY population_affected DESC LIMIT 15;

Conclusion

With even these basic optimizations, you can significantly improve query performance:

Indexing frequently used filter, grouping, and ordering columns (year_event, country, disease_name, population_affected).

Reading query plans with EXPLAIN ANALYZE to understand execution.

Tuning queries by making predicates SARGable, projecting only needed columns, and leveraging indexes for ordering.

These practices ensure your queries on global_health_statistics remain efficient even as the dataset grows.

Part IX — Security, Backup & Sharing

Ensuring data integrity and accessibility means thinking beyond querying. This section covers how to:

Secure access with roles & privileges
Create backups and restore databases
Export and share query results for reporting

1. Roles & Privileges — Granting Read-Only Access to Learners

PostgreSQL supports fine-grained control over who can connect to databases and what they can do inside.

Give Learners Read-Only Access to Global Health Data

– Identify the Database name SELECT datname FROM pg_database;

A. Create the Role for the Learner

DO $$
BEGIN
  IF NOT EXISTS (
    SELECT FROM pg_roles WHERE rolname = 'learner_user'
  ) THEN
    CREATE ROLE learner_user WITH LOGIN PASSWORD 'SafePass123';
  END IF;
END $$;

B. Grant Database Access

-- Allow learner_user to connect to the specific database eg SQL_Data_Manipulation database
GRANT CONNECT ON DATABASE "SQL_Data_Manipulation" TO learner_user;

C. Grant Schema-Level Access

-- Allow the user to access objects in the 'public' schema
GRANT USAGE ON SCHEMA public TO learner_user;

D. Grant Read Access to Specific Tables

-- Allow SELECT queries only
GRANT SELECT ON public.global_health_statistics TO learner_user;

Now, learner_user can log in, browse the schema, and query the dataset — but cannot modify anything.

Create Multiple Learners with a Group Role Instead of managing access per individual, create a shared group role and assign learners to it.

-- Create the role group
DO $$
BEGIN
  IF NOT EXISTS (
    SELECT FROM pg_roles WHERE rolname = 'learners_group'
  ) THEN
    CREATE ROLE learners_group;
  END IF;
END $$;

-- Create individual users
DO $$
BEGIN
  IF NOT EXISTS (
    SELECT FROM pg_roles WHERE rolname = 'learner_ali'
  ) THEN
    CREATE ROLE learner_ali WITH LOGIN PASSWORD 'Ali123';
  END IF;
END $$;

DO $$
BEGIN
  IF NOT EXISTS (
    SELECT FROM pg_roles WHERE rolname = 'learner_sara'
  ) THEN
    CREATE ROLE learner_sara WITH LOGIN PASSWORD 'Sara123';
  END IF;
END $$;

GRANT learners_group TO learner_sara;

-- Grant group access to the database
GRANT CONNECT ON DATABASE "SQL_Data_Manipulation" TO learners_group;

GRANT USAGE ON SCHEMA public TO learners_group;

GRANT SELECT ON public.global_health_statistics TO learners_group;

Audit Who Has Access

SELECT rolname
FROM pg_roles
WHERE has_database_privilege(rolname, 'SQL_Data_Manipulation', 'CONNECT');

Displaying records 1 - 10
rolname
pg_database_owner
pg_read_all_data
pg_write_all_data
pg_monitor
pg_read_all_settings
pg_read_all_stats
pg_stat_scan_tables
pg_read_server_files
pg_write_server_files
pg_execute_server_program

To see which roles have SELECT access on global_health_statistics:

SELECT grantee, privilege_type
FROM information_schema.role_table_grants
WHERE table_name = 'global_health_statistics';

9 records
grantee	privilege_type
postgres	TRIGGER
postgres	REFERENCES
postgres	TRUNCATE
postgres	DELETE
postgres	UPDATE
postgres	SELECT
postgres	INSERT
learner_user	SELECT
learners_group	SELECT

2. Backups and Restores

🔸 Using pgAdmin (GUI)

To Backup:

Right-click on SQL_Data_Manipulation
Click Backup
Choose format: Custom
Save the file (.backup)

To Restore:

Create a new empty database (e.g., SQL_Health_Restore)
Right-click → Restore
Load the .backup file

3. Exporting Results and Reporting with Quarto

Export to CSV using psql

-- **Export to CSV using psql**
COPY (
  SELECT disease_name, year_event, AVG(mortality_pct) AS avg_mortality
  FROM public.global_health_statistics
  GROUP BY disease_name, year_event
) TO 'C:\\temp\\disease_mortality.csv' CSV HEADER;

B. Render a Quarto Table or Chart

SELECT 
  disease_name,
  ROUND(AVG(mortality_pct)::numeric, 2) AS avg_mortality
FROM public.global_health_statistics
GROUP BY disease_name
ORDER BY avg_mortality DESC
LIMIT 10;

Displaying records 1 - 10
disease_name	avg_mortality
Malaria	5.20
Ebola	5.09
Cholera	5.08
Tuberculosis	5.06
Parkinson’s Disease	5.06
COVID-19	5.05
Hepatitis	5.05
Diabetes	5.04
Influenza	5.03
Measles	5.02

df <- read_csv("C:/temp/disease_mortality.csv")

Rows: 500 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): disease_name
dbl (2): year_event, avg_mortality

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ggplot(df, aes(x = reorder(disease_name, avg_mortality), y = avg_mortality)) +
geom_col(fill = "#0077cc") +
coord_flip() +
labs(title = "Top 10 Diseases by Average Mortality",
x = "Disease", y = "Mortality %")

Database Administration & Reporting Tasks

Task	Tool	Notes
Grant DB Access	SQL	`GRANT CONNECT`, `GRANT SELECT`
Bulk Learner Setup	SQL + Group Role	Use `learners_group`
Backup	pgAdmin / `pg_dump`	Choose Custom format
Restore	pgAdmin / `pg_restore`	Create empty DB first
Export	`\COPY` in `psql`	Save results as CSV
Reporting	Quarto + ggplot	Use dynamic SQL chunks

By combining PostgreSQL’s security model, backup features, and Quarto’s reporting power, you’re equipped to build a secure, scalable, and sharable data analysis environment.

📝 Summary and Conclusion

This presentation provided an end-to-end journey through PostgreSQL on Windows, covering everything from installation to advanced analytics and reporting.

We began with the foundations of SQL (ACID principles, PostgreSQL ecosystem, pgAdmin/psql), then demonstrated data loading and exploration. From there, we explored both core wrangling (SELECT, WHERE, JOIN, CASE) and advanced techniques (CTEs, window functions, deduplication, upserts).

The middle sections focused on descriptive statistics (aggregation, percentiles, crosstabs) and data quality (missing data handling, casting/cleaning, working with dates).
Next, we examined performance basics (indexing, query plans, tuning practices) to ensure queries run efficiently even on large health datasets.

Finally, we addressed the critical topics of security, backup, and sharing, demonstrating how to manage roles and privileges, perform reliable backups, and integrate results with Quarto for transparent reporting.

🎯 Key Takeaways

SQL is more than queries — it is a full environment for storing, securing, analyzing, and sharing data.
PostgreSQL + pgAdmin provide a robust platform for analytics work.
Integration with R and Quarto allows reproducible, professional-grade reporting.
Data governance practices (indexes, roles, backups) are as important as analysis techniques.

By combining technical SQL skills with performance, governance, and reproducibility, analysts can deliver results that are accurate, efficient, and trusted.

📚 References & Learning Resources

For further study and practice, here are recommended books and websites

Books - Beaulieu, A. (2020). Learning SQL: Generate, manipulate, and retrieve data (3rd ed.). O’Reilly Media. ISBN: 978-1492057611

Obe, R. O., & Hsu, L. S. (2017). PostgreSQL: Up and running: A practical guide to the advanced open source database (3rd ed.). O’Reilly Media. ISBN: 978-1491963418
DeBarros, A. (2018). Practical SQL: A beginner’s guide to storytelling with data. No Starch Press. ISBN: 978-1593278274

Websites & Documentation - PostgreSQL Official Documentation
- Postgres Tutorial
- Mode Analytics SQL Tutorial
- RStudio Quarto Documentation