# DBI connection created once and reused by SQL chunks
library(DBI);
library(RPostgres)
con <- dbConnect(
RPostgres::Postgres(),
dbname = "SQL_Data_Manipulation",
host = "localhost", port = 5432,
user = "postgres", password = Sys.getenv("PGPASSWORD")
)🧰 Objective and Contents of the Presentation
This presentation provides a focused, end-to-end path for working with PostgreSQL on Windows—from installation to analysis and reporting. It begins by installing PostgreSQL with the EnterpriseDB bundle, registering the server in pgAdmin, and using StackBuilder for optional tools. It then shows how to create a new database, import data (e.g., .csv), and perform core SQL data manipulation (filtering, grouping, joins, window functions). Finally, it demonstrates how to connect SQL with R via Quarto/R Markdown to generate reproducible reports in HTML/PDF that combine narrative, code, and results.
🧭 Presentation Roadmap
The flow is intentional: begin with foundations (Part I), load and explore data (Part II), then connect and publish reproducible outputs (Part III). After that, move through core and advanced wrangling (Parts IV–V), descriptive statistics (Part VI), data quality and types (Part VII), performance tuning (Part VIII), and finally security, backup, and sharing (Part IX).
Part I — Introduction and Installing SQL
- What is SQL & why it matters (ACID, integrity, performance)
- PostgreSQL overview & tooling (pgAdmin, psql)
Part II — Connecting with R, Rmarkdown, Quarto - Connecting from R/Quarto; running and saving scripts; comments; shortcuts
Part III — Data Loading & Exploration
- Table creation (DDL) & data types
- CSV import:
COPY/\copyand permissions - Basic exploration: previews, counts, distincts, NULL checks
Part IV — Wrangling Core (SELECT / WHERE / CASE / JOIN)
- Selecting & aliasing; computed columns
- Filtering, sorting, limiting
- Conditional logic with
CASE,COALESCE,NULLIF - Joins (INNER/LEFT/RIGHT/FULL) and practical patterns
- Appending data
Part V — Advanced Wrangling
- CTEs (
WITH) for readable pipelines - Window functions (ranking, partitions, moving calculations)
- Deduplication with
ROW_NUMBER(); upserts overview
Part VI — Data Quality, Types & Dates
- Missing data strategies (fill, filter, profile)
- Casting & cleaning (regex, numeric/text conversions)
- Date/time basics and analysis
Part VII — Descriptive Statistics & Tabulation
- Aggregations &
GROUP BY;HAVINGfilters - Ordered-set aggregates (percentiles)
- Crosstab/pivot with
tablefuncextension
Part VIII — Performance Basics
- Indexing strategies (B-tree on filters/group keys)
- Reading query plans with
EXPLAIN ANALYZE - Simple tuning tips (predicates, projections, ordering)
Part IX — Security, Backup & Sharing
- Roles & privileges (grant read-only to learners)
- Backups: pgAdmin /
pg_dump& restores - Exporting results (CSV/HTML) & reporting with Quarto
Part I — Introduction: Installing, RDBMS, operations and toolkit
This section sets the context: what SQL is, when a relational database is the right tool, and how ACID {Atomicity (all-or-nothing transactions), Consistency (always obeys constraints), Isolation (concurrent work behaves as if run alone), Durability (commits persist even after failures)} properties and well-designed schemas support reliable analytics at scale. It briefly compare mainstream RDBMS options. Then I walk through a clean Windows install using the EnterpriseDB bundle, registering the server in pgAdmin and verifying connectivity before any data work. In short, the hands-on toolkit centers on SELECT … FROM with WHERE filters, ORDER BY/LIMIT shaping, and DISTINCT/aliases; aggregations (COUNT/SUM/AVG) with GROUP BY/HAVING; joins to combine tables; and a few essentials—COALESCE/CASE, type casts (::), and common string/date helpers.
🧭 Relational Database Management Systems (RDBMS) Landscape
Relational Database Management Systems (RDBMS) are the engines that store and query structured, table-shaped data using SQL. They matter because they enforce data integrity (ACID transactions), power reporting and analytics, and keep apps reliable under load. Your choice of RDBMS impacts performance, features you can lean on, skills availability in the team, and ongoing licensing/hosting costs.
- PostgreSQL — Open-source, strong data types, constraints, JSON, CTEs (Common Table Expression), window functions, and extensions (e.g., PostGIS). Excellent for analytics + app backends.
- MySQL / MariaDB — Common in web stacks and hosting; straightforward and lightweight. Great for websites and SaaS backends.
- Microsoft SQL Server — Popular in enterprise settings (Azure/Windows shops). Tight integration with Power BI and the MS ecosystem.
- Oracle Database — Feature-rich enterprise workhorse used by government/finance; heavy on licensing and administration.
- Managed Cloud (DBaaS) — AWS RDS / Azure / Google Cloud SQL hosting PostgreSQL/MySQL/SQL Server with patching, backups, and scaling handled for you.
This Project choice — PostgreSQL:
I used PostgreSQL for this project because it offers modern SQL features without licence, runs well on Windows and all major clouds, and scales from local development to production. It has great community support and extensions when you need them.
🔗 Downloading and Installing SQL
Download the PostgreSQL installer from:
👉 https://www.enterprisedb.com/downloads/postgres-postgresql-downloads
Select the Windows version eg x86-64. This installer is bundled with:
- PostgreSQL Server
- pgAdmin
- StackBuilder
- Command-line tools like
psql
Bundling everything in one installer saves setup headaches later on.
📥 Installation Steps (Windows)
Run the Installer
Double-click the.exefile to launch the setup wizard.Select Components
Leave all defaults selected:- ✅ PostgreSQL Server
- ✅ pgAdmin
- ✅ Command Line Tools
- 🔄 StackBuilder (highly recommended)
Choose Installation Directory
Stick with:C:\Program Files\PostgreSQL\<version>, unless you have a good reason to move it.Set a Password for
postgres
This password secures the default admin account.
💡 Tip: store it safely — you’ll need it to log into pgAdmin later.Keep the Default Port
Port5432is standard for PostgreSQL. Change it only if you know you have a conflict.Accept the Locale Settings
The default system locale works fine for most projects. You can later create databases with a specific locale if requiredFinish the Installation
After confirming settings, the installer will complete the setup and initialise the database service.
First Launch: Setting Up pgAdmin
Launch pgAdmin from the Start Menu.
Click “Add New Server”
You’ll be prompted to register your PostgreSQL server so pgAdmin can manage it.
Configure the Connection
- General tab:
- Name:
Local PostgreSQL
- Name:
- Connection tab:
- Host:
localhost - Port:
5432 - Username:
postgres - Password: (your password from earlier)
- Host:
✅ Check Save Password so you’re not prompted every time.
- General tab:
Click Save
You’ll now see your server listed in the left pane. Expand it to view databases, schemas, tables, and more.
- If you see two entries (e.g., Local PostgreSQL and PostgreSQL 17), they’re just pgAdmin shortcuts. Keep the auto-detected one and feel free to delete the duplicate ne eg localhost.
- You’ll reconnect each time you reopen pgAdmin (right-click server → Connect). If the password isn’t remembered, re-enter it and tick Save Password.
Verifying the Connection
- Right-click the server → Connect
- Expand
Databases>postgres>Schemas>public - This confirms your PostgreSQL server is active and accessible.
You’re now ready to manage your databases through the GUI.
🔌 Using StackBuilder to Install Extensions
- Select your PostgreSQL instance (e.g.,
PostgreSQL 15 on port 5432) - Click Next
- You’ll see several categories. Here’s a guide:
What is recommended:
| Category | Description |
|---|---|
| Add-ons, tools and utilities | Tools like pgAgent, pgBackRest, or adminpack — helpful for backups, scheduling, and server-side utilities. |
| Database Drivers | ODBC, JDBC, and .NET drivers — install if you’re planning to connect PostgreSQL from BI tools or other languages like Java or R. |
Optional:
| Category | Use only if… |
|---|---|
| Spatial Extensions | You need GIS support — this includes PostGIS. |
| Web Development | You’re working on a full PHP/PostgreSQL stack (less common). |
| Database Server | Already installed via initial setup — no need to repeat. |
| Registration-required | Mostly trialware; skip unless testing something specific. |
Once selected, follow the prompts to download and install the tools/extensions you need.
- Reconnect behaviour: Each time you reopen pgAdmin, right-click the server → Connect (password won’t be prompted if saved).
If you see
fe_sendauth: no password supplied, open the server properties → Connection tab → re-enter password and tick Save Password if required.
SQL Essential Functions and Operations — Overview
This section offers an overview of essential SQL functions and operations, focusing on their use in data manipulation. We’ll explore core concepts that allow you to retrieve, filter, sort, and aggregate data, which are fundamental to working with any dataset. Later, these principles will be demonstrated using the public.global_health_statistics dataset within the SQL_Data_Manipulation database.
- Querying Basics Use
SELECT ... FROM ...to retrieve columns and rows. AddWHEREto filter,ORDER BYto sort,LIMITto cap rows, andDISTINCTto remove duplicates. Column and table aliases keep queries readable.
- Operators & Predicates Core building blocks for conditions: comparison (
=,<>,<,<=,>,>=), logical (AND,OR,NOT), membership (IN), ranges (BETWEEN), pattern match (LIKE,ILIKE), null checks (IS NULL,IS NOT NULL), and regex (~,~*) in PostgreSQL.
- Identifiers, Aliases, Schemas Qualify objects as
schema.table(e.g.,public.global_health_statistics). UseASfor aliases (e.g.,ghs). Strings use single quotes'...'; double quotes preserve case/characters in identifiers; end statements with;. Comments:-- lineand/* block */.
- Aggregation & Grouping Aggregate with
COUNT,SUM,AVG,MIN,MAX. UseGROUP BYto define groups andHAVINGto filter aggregated results.chn
- Joining Data Combine tables on keys. Typical joins: INNER (matches), LEFT (preserve left), RIGHT, FULL, and CROSS. Prefer explicit
JOIN ... ON ...with clear key conditions.
- Subqueries & CTEs Nest queries inside
WHERE/FROMor use CTEs withWITH. CTEs improve readability and can be chained; PostgreSQL also supportsWITH RECURSIVEfor hierarchical problems.
- Set Operations Stack compatible result sets vertically:
UNION(dedup),UNION ALL(no dedup),INTERSECT(overlap),EXCEPT(left minus right).
- Window Functions Compute ranks, running totals, and moving averages without collapsing rows:
... OVER (PARTITION BY ... ORDER BY ...). Common helpers:ROW_NUMBER,RANK,LAG,LEAD,SUM/AVG OVER.
- Strings, Numbers, Dates Strings:
UPPER,LOWER,TRIM,LENGTH,REPLACE,SUBSTRING, concatenation||.
Numbers:ROUND,CEIL,FLOOR, safe division withNULLIF.
Dates:NOW,EXTRACT,DATE_TRUNC,MAKE_DATE. Cast types with::orCAST(...).
- Conditional & NULL Handling
CASEbuilds conditional expressions;COALESCEpicks the first non-NULL value;NULLIF(a,b)returnsNULLifa=b(useful for safe division).
- DDL (Data Definition Language) Structures & Indexes Define and evolve schema:
CREATE/ALTER/DROP TABLE, constraints (PRIMARY KEY,UNIQUE,NOT NULL,CHECK), andCREATE INDEXto speed up common filters/joins.
- Transactions & Security Wrap changes in
BEGIN ... COMMIT(orROLLBACK). UseSAVEPOINTfor partial rollbacks. Control access withGRANT/REVOKE.
- Loading & Export Bulk load/export CSV using
COPY(server-side) or\copy(psql client-side). On Windows, prefer forward slashes (C:/...), and if permissions fail, move files toC:/temp/.
- Introspection & “Help” Explore metadata via
information_schema(tables/columns) or psql meta-commands (\dt,\d ghs,\dn,\l).SHOW search_path;reveals the active schema order.
📋 Summary Table — Operators, Functions & Patterns
Abbreviation used below:
ghs=public.global_health_statistics
| Operator / Function / Pattern | What it does | Tiny example (PostgreSQL) |
|---|---|---|
SELECT ... FROM |
Retrieve columns/rows | SELECT country, year_event FROM ghs; |
WHERE |
Filter rows | ... FROM ghs WHERE year_event >= 2018; |
ORDER BY |
Sort output | ... ORDER BY incidence DESC; |
LIMIT / OFFSET |
Cap rows / paginate | ... ORDER BY dalys DESC LIMIT 10; |
DISTINCT |
Remove duplicates | SELECT DISTINCT disease_category FROM ghs; |
AS (aliases) |
Rename columns/tables | SELECT dalys AS total_dalys FROM ghs; |
| Comparison ops | =, <>, <, <=, >, >= |
WHERE country = 'Australia'; |
| Logical ops | AND, OR, NOT |
WHERE year_event>=2015 AND dalys>0; |
| Membership | IN (...) |
WHERE country IN ('Australia','USA'); |
| Range | BETWEEN a AND b |
WHERE year_event BETWEEN 2015 AND 2024; |
| Pattern match | LIKE, ILIKE |
WHERE disease_name ILIKE 'influenza%'; |
| Regex (PG) | ~, ~* |
WHERE disease_name ~* 'flu|covid'; |
| Null checks | IS NULL, IS NOT NULL |
WHERE incidence IS NULL; |
| String concat | || |
SELECT country||' - '||disease_name FROM ghs; |
| Casting | ::type or CAST() |
SELECT incidence::numeric FROM ghs; |
| Aggregates | COUNT/SUM/AVG/MIN/MAX |
SELECT SUM(dalys) FROM ghs; |
GROUP BY |
Define groups | ... GROUP BY country, year_event; |
HAVING |
Filter groups | ... GROUP BY country HAVING SUM(dalys)>0; |
| JOIN (all) | Combine tables: INNER/LEFT/RIGHT/FULL/CROSS |
SELECT g.country,r.region FROM ghs g LEFT JOIN country_regions r ON r.country=g.country; |
| Subquery (IN) | Filter by another query | WHERE country IN (SELECT country FROM focus_list); |
CTE (WITH) |
Name intermediate result | WITH recent AS (SELECT * FROM ghs WHERE year_event>=2020) SELECT * FROM recent; |
| Set ops | UNION, UNION ALL, INTERSECT, EXCEPT |
(SELECT country FROM ghs WHERE dalys>0) INTERSECT (SELECT country FROM ghs WHERE incidence>0); |
Window OVER(...) |
Rank/totals per partition | SUM(dalys) OVER (PARTITION BY country); |
| Ranking | ROW_NUMBER, RANK |
RANK() OVER (PARTITION BY country ORDER BY SUM(dalys) DESC); |
| LAG/LEAD | Prior/next row value | LEAD(incidence) OVER (PARTITION BY country ORDER BY year_event); |
CASE |
Conditional logic | CASE WHEN mortality_pct>5 THEN 'High' ELSE 'Low' END; |
COALESCE / NULLIF |
Null handling & safe division | COALESCE(incidence,0)/NULLIF(population_affected,0); |
| Strings | UPPER/LOWER/TRIM/LENGTH/REPLACE |
SELECT UPPER(country), LENGTH(disease_name) FROM ghs; |
| Dates | NOW/EXTRACT/DATE_TRUNC/MAKE_DATE |
DATE_TRUNC('year', MAKE_DATE(year_event,1,1)); |
| DML | INSERT/UPDATE/DELETE/MERGE |
UPDATE ghs SET incidence=0 WHERE incidence IS NULL; |
| DDL | CREATE/ALTER/DROP TABLE |
ALTER TABLE ghs ADD COLUMN notes TEXT; |
| Constraints | PRIMARY KEY/UNIQUE/CHECK/NOT NULL |
ALTER TABLE ghs ADD CHECK (mortality_pct BETWEEN 0 AND 100); |
| Indexes | CREATE INDEX |
CREATE INDEX ON ghs(country, year_event); |
| Transactions | BEGIN/COMMIT/ROLLBACK/SAVEPOINT |
BEGIN; UPDATE ghs SET dalys=0; ROLLBACK; |
| Security | GRANT/REVOKE |
GRANT SELECT ON ghs TO analyst_role; |
| Load/Export | COPY / \copy |
COPY ghs FROM 'C:/temp/file.csv' WITH (CSV,HEADER); |
| Introspection | information_schema, psql \d |
SELECT column_name FROM information_schema.columns WHERE table_name='global_health_statistics'; |
| Comments | --, /* ... */ |
-- count rows |
Notes to Keep Handy
- Prefer explicit joins with clear keys; avoid ambiguous
SELECT *in production code. - List columns in
COPYfor stable mapping if CSV order changes. - Use forward slashes on Windows paths with
COPY(C:/...). - Create composite indexes to match common filters (e.g.,
(country, year_event)). - Always sanity-check results with quick counts/limits before heavy analysis.
Part II — Connecting with R, Rmarkdown, Quarto
Why connect SQL with R (and publish via .Rmd/.qmd)?
Modern analytics benefits from using SQL + R together: - SQL pushes heavy filtering/aggregation to the database (fast, scalable, auditable). - R adds wrangling, visualization, statistics, and reproducible documents.
Why use R Markdown (.Rmd) or Quarto (.qmd):
- Reproducibility: prose + code + outputs in one file; anyone can re-run end to end.
- Transparency: every figure or table is tied to the exact SQL/R code that produced it.
- Shareability: render to HTML/PDF/Word for the web or submission.
- Multi-language: mix R chunks and SQL chunks; SQL runs through a live DBI connection created in R.
- Automation: parameterized reports, scheduled renders, version control.
How SQL chunks work in Quarto/R Markdown
1) Create a DBI connection object in an R chunk (e.g., con <- dbConnect(...)).
2) Use SQL chunks with connection = con to run SQL on that database.
3) Outputs (tables) are printed inline under each SQL chunk.
🔐 Security tip: Avoid committing passwords to public repos.
Execution order & prerequisites:
- Install required R packages (DBI, RSQLite, RPostgres, readr, tidyverse, dbplyr).
- Ensure your database server (PostgreSQL) is running when connecting to it.
- Verify file paths (e.g., C:/temp/Global_Health_Statistics.csv) exist on your machine.
- Render top-to-bottom so the connection objects exist before SQL chunks that use them.
R Markdown & Quarto Options (R + SQL)
This section summarizes chunk options and syntax when building reports that mix R chunks and SQL chunks (e.g., ``{sql, connection = con}). The goal is to control execution, display, and reproducibility, while running queries directly against a database and printing results inline.
How SQL chunks work (at a glance) - Open a DBI connection in an R chunk (e.g., con <- dbConnect(...)). - Run queries in SQL chunks using connection = con. - Options like echo, eval, include, warning, message, error, and cache behave the same as in R chunks. - SQL chunk outputs render as tables; you can also capture them into R objects with output.var.
Summary: Quarto/R Markdown Chunk Options (R & SQL)
| Feature | Quarto / knitr Syntax | Description (applies to R & SQL chunks) |
|---|---|---|
| Code chunk (R) | {r, warning=FALSE, message=FALSE} |
Execute R code; suppress messages/warnings for clean output. |
| Code chunk (SQL) | {sql, connection = con} |
Execute SQL against a DBI connection object created in an R chunk. |
| Show/hide code | echo: true/false |
Display the source code (true) or hide it (false). |
| Run/skip code | eval: true/false |
Execute the chunk or show code without running it. |
| Hide code & output | include: false |
Run side effects (e.g., temp tables) but don’t print code or results. |
| Control results | results: 'markup'|'asis'|'hide' |
Choose how printed results appear. |
| Messages & warnings | message: false, warning: false |
Suppress messages/warnings in output. |
| Error handling | error: true/false |
If true, continue rendering and show the error text. |
| Caching | cache: true |
Reuse results when inputs unchanged (be careful if DB data changes). |
| Chunk labels | {r my-label}, {sql my-label, connection = con} |
Name chunks for organization, cross-refs, and cache keys. |
| Figures (R) | fig.width, fig.height, fig.align |
Size/alignment for plots created in R chunks. |
| Table printing | YAML df-print: paged |
Global HTML table style; affects SQL outputs too. |
| Output formats | YAML format: |
Select html, pdf, docx, etc., plus format-specific options. |
SQL–specific conveniences
| SQL Feature | Syntax | What it does |
|---|---|---|
| Bind a connection | {sql, connection = con} |
Runs the SQL chunk using the DBI connection defined in an R chunk. |
| Capture result | output.var = 'df' |
Saves the query output into an R object (e.g., df). |
| Show query only | eval: false |
Display the SQL but do not execute (useful for teaching/drafts). |
| Show result only | echo: false |
Execute query and print just the table (hide the SQL text). |
| Hidden setup | include: false |
Execute without showing anything (e.g., temp objects). |
| Resilient to errors | error: true |
Don’t halt the build on a chunk error (prints the error). |
Examples
Abbreviations & acronyms used in this section
- SQL — Structured Query Language
- RDBMS — Relational Database Management System
- ACID — Atomicity, Consistency, Isolation, Durability
- DBaaS — Database as a Service (e.g., RDS/Azure/GCP)
- GUI — Graphical User Interface (e.g., pgAdmin)
- CLI — Command-Line Interface (e.g.,
psql) - CTE — Common Table Expression (
WITH,WITH RECURSIVE) - DDL — Data Definition Language (
CREATE/ALTER/DROP) - DML — Data Manipulation Language (
SELECT/INSERT/UPDATE/DELETE/MERGE) - DCL — Data Control Language (
GRANT/REVOKE) - PK — Primary Key
- FK — Foreign Key
- MV — Materialized View
- JSON/JSONB — JavaScript Object Notation (text/binary)
- GIN — Generalized Inverted Index (commonly for
jsonb) - WAL — Write-Ahead Log (durability in Postgres)
- QMD/RMD — Quarto/ R Markdown documents
Part III — Data Loading & Exploration
- Table creation (DDL) & data types
- CSV import:
COPY/\copyand permissions - Basic exploration: previews, counts, distincts, NULL checks
Create the Project Database (with Australian locale)
We’ll call the project SQL_Data_Manipulation.
Note: You created the database with quoted/mixed case. In PostgreSQL, database names are case-sensitive if created with quotes. Always use the exact casing SQL_Data_Manipulation in tools/connectors.
Run this from the postgres database (or any admin DB) in pgAdmin Query Tool:
-- Create database with English_Australia.1252 locale (Windows)
-- If it already exists, skip this block.
CREATE DATABASE "SQL_Data_Manipulation"
WITH
OWNER = postgres
ENCODING = 'UTF8'
LC_COLLATE = 'English_Australia.1252'
LC_CTYPE = 'English_Australia.1252'
LOCALE_PROVIDER = 'libc'
TABLESPACE = pg_default
CONNECTION LIMIT = -1
IS_TEMPLATE = FALSE;
COMMENT ON DATABASE "SQL_Data_Manipulation"
IS 'This is the DB for SQL data manipulation project.';After creation: In pgAdmin, right-click your server → Refresh, then select database SQL_Data_Manipulation and open Query Tool from that database node so subsequent SQL runs in the right place.
💡 Locale note (Windows): English_Australia.1252 controls collation/ctype (sorting/character classification). If creation fails, verify that locale is installed on the OS.
- Expand your server → right-click Databases → Create > Database…
- Database:
SQL_Data_Manipulation_Guide→ Save.
(Optional) Set this database as the Query Tool target using the dropdown in the toolbar.
🧱 Create the Target Table (in SQL_Data_Manipulation)
-- Ensure you're connected to the "SQL_Data_Manipulation" database before running this.
CREATE TABLE IF NOT EXISTS public.global_health_statistics (
country TEXT,
year_event INT,
disease_name TEXT,
disease_category TEXT,
prevalence NUMERIC,
incidence NUMERIC,
mortality_pct NUMERIC,
age_group TEXT,
gender TEXT,
population_affected BIGINT,
healthcare_access TEXT,
dr_per_1000 NUMERIC,
hospital_beds_per_1000 NUMERIC,
treatment_type TEXT,
average_treatment_cost_usd NUMERIC,
availability_of_vaccines_treatment TEXT,
recovery_rate_pct NUMERIC,
dalys NUMERIC,
improvement_in_5_years_pct NUMERIC,
per_capita_income_usd NUMERIC,
education_index NUMERIC,
urbanization_rate_pct NUMERIC
);📥 Importing Data — Options and Tips
Importing Data Files into SQL
Before analyzing data in SQL, we need to bring it into a database table.
Unlike R or Python, SQL engines do not read files directly; instead, they rely on database commands (COPY, \copy, BULK INSERT, or external tools) to import files into a table.
Below is a summary of how different file types can be imported into common SQL engines.
| Data Type | SQL Import Method | Database | Example |
|---|---|---|---|
| CSV | COPY or \copy (Postgres), BULK INSERT (SQL Server), .import (SQLite) |
PostgreSQL, SQLite, SQL Server | COPY mytable FROM 'file.csv' CSV HEADER; |
| TSV / Text (.tsv) | Same as CSV but with tab delimiter specified | PostgreSQL, SQLite | COPY mytable FROM 'file.tsv' DELIMITER E'\t' CSV HEADER; |
| Excel (.xls/.xlsx) | Convert to CSV first, then import (native SQL cannot parse Excel directly) | All | Save as CSV → COPY into SQL |
| ZIP archives | Extract file first, then import csv into SQL | All | unzip data.zip → COPY mytable FROM 'file.csv' CSV HEADER; |
| Stata/SPSS (.dta/.sav) | Convert to CSV using R/Python/StatTransfer, then load into SQL | All | R: write.csv(read_dta("file.dta"), "file.csv") → SQL COPY |
| Online files | Download first (with wget/curl), then import csv |
All | wget URL -O file.csv → SQL COPY |
Two ways to load CSV data:
Option A — pgAdmin GUI (Import/Export)
Right-click table → View/Edit Data → All Rows
Click Import/Export (📥)
Import tab: choose file, format CSV, check Header, set delimiter ,
**OK**
Option B — SQL Import
There are two SQL variants:
B1. Server-side COPY (runs on the DB server process)
Works in pgAdmin Query Tool.
The server must be able to read the file path. On Windows, the service account often cannot read Desktop/Documents.
- Fix: move the file to a permissive folder like
C:/temp/.
COPY public.global_health_statistics (
country, year_event, disease_name, disease_category, prevalence, incidence, mortality_pct,
age_group, gender, population_affected, healthcare_access, dr_per_1000,
hospital_beds_per_1000, treatment_type, average_treatment_cost_usd,
availability_of_vaccines_treatment, recovery_rate_pct, dalys,
improvement_in_5_years_pct, per_capita_income_usd, education_index, urbanization_rate_pct
)
FROM 'C:/temp/Global_Health_Statistics.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',', QUOTE '"', ESCAPE '"', ENCODING 'UTF8');B2. Client-side (psql)
Only in psql (ie Use SQL Shell), not pgAdmin.
Avoids server-permission issues (can read Desktop paths).
\copy public.global_health_statistics (
country, year_event, disease_name, disease_category, prevalence, incidence, mortality_pct,
age_group, gender, population_affected, healthcare_access, dr_per_1000,
hospital_beds_per_1000, treatment_type, average_treatment_cost_usd,
availability_of_vaccines_treatment, recovery_rate_pct, dalys,
improvement_in_5_years_pct, per_capita_income_usd, education_index, urbanization_rate_pct
)
FROM 'C:/Users/User/Desktop/Materials_ Course and proposals/Course Related/Data_source/Global_Health_Statistics.csv'
WITH (FORMAT csv, HEADER true, DELIMITER ',', QUOTE '"', ESCAPE '"', ENCODING 'UTF8');Tips (Windows paths)
Prefer forward slashes (C:/…) to avoid escaping backslashes.
If permission denied appears with COPY, move the file to C:/temp/ or use .
Method used in this project
Used: Option B1 — server-side COPY from C:/temp/Global_Health_Statistics.csv.
Result: COPY 1000000 → 1,000,000 rows imported successfully.
Where to See the Imported Data (Names & Columns)
In pgAdmin: Servers → Databases → SQL_Data_Manipulation → Schemas → public → Tables → global_health_statistics
Right-click → View/Edit Data → All Rows to preview.
Select the table → Columns tab (right pane) to see column names and types.
1). Importing data to R from local path
This chunk reads the CSV into R, opens a SQLite database file named
SQL_Data_Manipulation, and writes the data as a SQLite table. It’s a lightweight, local option for quick iteration (no server required).
# Clear environment and console
rm(list = ls()) # To clear the environment (remove all objects in memory)
cat("\014") # To clear the console (like wiping a whiteboard)library(DBI)
library(RSQLite)
library(readr)
library(tidyverse)
library(dbplyr)
# Read CSV
global_health_statistics <- read_csv("C:/temp/Global_Health_Statistics.csv")
# Connect to SQLite
con <- dbConnect(RSQLite::SQLite(), dbname = "SQL_Data_Manipulation")
# Write table to DB
dbWriteTable(con, "global_health_statistics", global_health_statistics, overwrite = TRUE)
# knitr::opts_chunk$set(connection = "con")Use SQL codes: from here and onwards
The next SQL chunk runs against the current con connection (SQLite here) and previews rows.
SELECT * FROM global_health_statistics LIMIT 10;| country | year_event | disease_name | disease_category | prevalence | incidence | mortality_pct | age_group | gender | population_affected | healthcare_access | dr_per_1000 | hospital_beds_per_1000 | treatment_type | average_treatment_cost_usd | availability_of_vaccines_treatment | recovery_rate_pct | dalys | improvement_in_5_years_pct | per_capita_income_usd | education_index | urbanization_rate_pct |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Italy | 2013 | Malaria | Respiratory | 0.95 | 1.55 | 8.42 | 0-18 | Male | 471007 | 57.74 | 3.34 | 7.58 | Medication | 21064 | No | 91.82 | 4493 | 2.16 | 16886 | 0.79 | 86.02 |
| France | 2002 | Ebola | Parasitic | 12.46 | 8.63 | 8.75 | 61+ | Male | 634318 | 89.21 | 1.33 | 5.11 | Surgery | 47851 | Yes | 76.65 | 2366 | 4.82 | 80639 | 0.74 | 45.52 |
| Turkey | 2015 | COVID-19 | Genetic | 0.91 | 2.35 | 6.22 | 36-60 | Male | 154878 | 56.41 | 4.07 | 3.49 | Vaccination | 27834 | Yes | 98.55 | 41 | 5.81 | 12245 | 0.41 | 40.20 |
| Indonesia | 2011 | Parkinson’s Disease | Autoimmune | 4.68 | 6.29 | 3.99 | 0-18 | Other | 446224 | 85.20 | 3.18 | 8.44 | Surgery | 144 | Yes | 67.35 | 3201 | 2.22 | 49336 | 0.49 | 58.47 |
| Italy | 2013 | Tuberculosis | Genetic | 0.83 | 13.59 | 7.01 | 61+ | Male | 472908 | 67.00 | 4.61 | 5.90 | Medication | 8908 | Yes | 50.06 | 2832 | 6.93 | 47701 | 0.50 | 48.14 |
| Saudi Arabia | 2011 | Dengue | Bacterial | 10.99 | 6.49 | 4.64 | 61+ | Female | 479234 | 98.41 | 3.50 | 0.62 | Therapy | 42671 | Yes | 93.17 | 416 | 9.83 | 29597 | 0.46 | 56.50 |
| USA | 2013 | Malaria | Cardiovascular | 18.42 | 6.33 | 9.33 | 61+ | Female | 289931 | 90.03 | 3.16 | 3.31 | Surgery | 15579 | No | 92.80 | 4535 | 0.89 | 60027 | 0.70 | 20.48 |
| Nigeria | 2007 | Tuberculosis | Neurological | 3.48 | 5.71 | 1.21 | 0-18 | Female | 393296 | 75.60 | 0.54 | 3.54 | Medication | 15744 | Yes | 65.45 | 4584 | 9.81 | 23222 | 0.46 | 66.49 |
| Italy | 2000 | Rabies | Chronic | 15.59 | 4.74 | 6.38 | 19-35 | Female | 253110 | 87.87 | 4.56 | 5.84 | Therapy | 7669 | Yes | 59.23 | 2253 | 9.92 | 30849 | 0.55 | 41.27 |
| Australia | 2006 | Cholera | Chronic | 10.12 | 2.08 | 6.00 | 61+ | Male | 174143 | 95.90 | 4.63 | 6.01 | Medication | 9468 | Yes | 93.21 | 4694 | 2.96 | 68856 | 0.90 | 83.30 |
SELECT
COUNT(*) FILTER (WHERE country IS NULL) AS null_country,
COUNT(*) FILTER (WHERE disease_name IS NULL) AS null_disease
FROM global_health_statistics;| null_country | null_disease |
|---|---|
| 0 | 0 |
2). Reading the data available at postgres in SQL Via R
This section opens a PostgreSQL connection using RPostgres to the database “SQL_Data_Manipulation”. Best practice is to keep credentials out of code (use env vars or keychain).
# install.packages(c("DBI","RPostgres","dplyr")) # run once
library(DBI)
library(RPostgres)
library(dplyr)
# (Optional) keep credentials out of the script
Sys.setenv(PGUSER = "postgres", PGPASSWORD = "yourpassword")
con <- dbConnect(
RPostgres::Postgres(),
host = "localhost",
port = 5432,
dbname = "SQL_Data_Manipulation", # matches your screenshot
user = Sys.getenv("PGUSER", "postgres"),
password = Sys.getenv("PGPASSWORD", "yourpassword")
)Pull the table from PostgreSQL into an R data frame (or use dbGetQuery to run a SQL SELECT).
global_health_statistics <- dbReadTable(con, Id(schema = "public", table = "global_health_statistics"))
# or: ghs <- dbGetQuery(con, "SELECT * FROM public.global_health_statistics;")Reading and Importing Data
In SQL, importing means loading external files into a database table so they can be queried.
Unlike R, where we use functions like read_csv() or read_excel(), in SQL we rely on commands such as COPY, LOAD DATA INFILE, or BULK INSERT depending on the database system.
📊 Reading Common Data Types in SQL
| Data Type | SQL Command / Method | Notes | Example |
|---|---|---|---|
| CSV | COPY (Postgres), LOAD DATA INFILE (MySQL), BULK INSERT (SQL Server) |
Most common tabular format | COPY patients FROM '/path/patients.csv' CSV HEADER; |
| Excel (.xls/.xlsx) | Not directly supported | Convert to CSV or import via ETL tools | |
| Tab-separated (.tsv) | Same as CSV with delimiter \t |
COPY patients FROM '/path/file.tsv' DELIMITER E'\t' CSV HEADER; |
|
| Text (.txt) | Load as delimited file | Need to define delimiter/schema | |
| Stata (.dta), SPSS (.sav), RDS | Not supported natively | Convert to CSV first (R/Python/SAS) | |
| ZIP archives | SQL cannot unzip | Unzip externally, then import CSV | |
| Online CSV/Excel | Some cloud DBs (BigQuery, Snowflake) allow URL loading | Otherwise: download first | LOAD DATA ... FROM 'https://...' |
Use SQL codes: from here and on wards: Run a SQL query against PostgreSQL
This demonstration shows how I use SQL to explore and validate a large dataset on global health statistics. Each section introduces the query with a short explanation, followed by the SQL code and notes about the expected output.
Data Exploration
Before analysis, it is important to preview the dataset to confirm structure and column names.
The following query selects the first 10 rows for each variable country, year_event, disease_name, prevalence, incidence, mortality_pct, and socio-economic indicators.
-- check some Sample rows
SELECT *
FROM public.global_health_statistics
LIMIT 10;| country | year_event | disease_name | disease_category | prevalence | incidence | mortality_pct | age_group | gender | population_affected | healthcare_access | dr_per_1000 | hospital_beds_per_1000 | treatment_type | average_treatment_cost_usd | availability_of_vaccines_treatment | recovery_rate_pct | dalys | improvement_in_5_years_pct | per_capita_income_usd | education_index | urbanization_rate_pct |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| China | 2015 | Hepatitis | Genetic | 19.87 | 3.07 | 8.54 | 0-18 | Male | 979238 | 71.63 | 1.71 | 1.16 | Therapy | 33951 | Yes | 51.60 | 980 | 1.62 | 74848 | 0.84 | 51.33 |
| Australia | 2003 | Asthma | Bacterial | 16.47 | 5.11 | 7.73 | 0-18 | Female | 983250 | 58.48 | 4.22 | 1.08 | Therapy | 18945 | Yes | 74.70 | 1794 | 7.15 | 59994 | 0.54 | 44.09 |
| Germany | 2003 | Dengue | Neurological | 0.50 | 3.89 | 9.19 | 19-35 | Male | 995467 | 53.86 | 2.98 | 8.98 | Therapy | 16993 | No | 69.58 | 2569 | 0.05 | 16782 | 0.40 | 52.60 |
| China | 2002 | Malaria | Autoimmune | 18.38 | 3.81 | 6.77 | 19-35 | Female | 990949 | 62.36 | 1.66 | 9.68 | Therapy | 17764 | Yes | 97.23 | 3047 | 5.84 | 89944 | 0.81 | 21.99 |
| Russia | 2007 | COVID-19 | Bacterial | 15.03 | 12.88 | 3.54 | 61+ | Other | 996438 | 88.06 | 4.02 | 2.22 | Vaccination | 29279 | No | 56.19 | 4102 | 9.71 | 43114 | 0.49 | 57.31 |
| Russia | 2012 | Polio | Bacterial | 16.17 | 9.98 | 3.28 | 19-35 | Other | 998885 | 80.62 | 3.92 | 3.00 | Vaccination | 34751 | Yes | 89.90 | 1876 | 8.80 | 40763 | 0.45 | 54.51 |
| South Africa | 2000 | Hypertension | Neurological | 3.53 | 5.05 | 5.89 | 19-35 | Male | 977922 | 62.92 | 3.59 | 9.15 | Therapy | 44512 | Yes | 92.83 | 1045 | 5.52 | 57064 | 0.50 | 50.14 |
| South Korea | 2009 | Cholera | Bacterial | 17.03 | 14.88 | 5.54 | 36-60 | Male | 993314 | 96.23 | 2.55 | 6.10 | Therapy | 25692 | Yes | 73.43 | 180 | 7.13 | 60687 | 0.53 | 30.15 |
| Turkey | 2017 | Leprosy | Bacterial | 18.75 | 8.30 | 1.51 | 61+ | Other | 972077 | 75.82 | 4.16 | 8.22 | Medication | 2330 | Yes | 58.85 | 1940 | 8.76 | 96878 | 0.68 | 62.72 |
| Indonesia | 2019 | Dengue | Respiratory | 15.21 | 3.26 | 9.92 | 61+ | Male | 991463 | 69.99 | 1.58 | 1.45 | Therapy | 23547 | Yes | 50.35 | 994 | 2.74 | 71648 | 0.60 | 32.33 |
Confirm how many rows exist in the table: In this case: 1,000,000 rows.
-- Total rows loaded
SELECT COUNT(*) AS rows_loaded
FROM public.global_health_statistics;| rows_loaded |
|---|
| 10000 |
Check the earliest and latest years in the dataset: spans from 2000 to 2024.
SELECT MIN(year_event) AS min_year, MAX(year_event) AS max_year
FROM public.global_health_statistics;| min_year | max_year |
|---|---|
| 2000 | 2024 |
Look at the most recent events by sorting years in descending order:
SELECT country, year_event, incidence
FROM public.global_health_statistics
ORDER BY year_event DESC
LIMIT 10;| country | year_event | incidence |
|---|---|---|
| Indonesia | 2024 | 9.01 |
| Germany | 2024 | 11.98 |
| France | 2024 | 12.53 |
| Canada | 2024 | 10.95 |
| Nigeria | 2024 | 5.49 |
| Brazil | 2024 | 7.82 |
| Russia | 2024 | 4.51 |
| Japan | 2024 | 3.38 |
| UK | 2024 | 10.58 |
| Saudi Arabia | 2024 | 4.10 |
Checking Data Completeness
Check for missing values at the PostgreSQL level: Both null_country and null_disease return 0.
SELECT
COUNT(*) FILTER (WHERE country IS NULL) AS null_country,
COUNT(*) FILTER (WHERE disease_name IS NULL) AS null_disease
FROM public.global_health_statistics;| null_country | null_disease |
|---|---|
| 0 | 0 |
Creating a Basic Data Dictionary
-- PostgreSQL example: auto-generate a simple data dictionary
SELECT
table_name,
column_name,
data_type,
is_nullable
FROM information_schema.columns
WHERE table_schema = 'public'
AND table_name = 'patients';| table_name | column_name | data_type | is_nullable |
|---|
📊 Summary Table: SQL Equivalents
The table below provides SQL equivalents for common R exploration tasks.
It shows the task, the general SQL approach, and an example query.
| Task | SQL Approach | Example |
|---|---|---|
| View structure of dataset | Query metadata | SELECT column_name, data_type FROM information_schema.columns |
| Summarize statistics | Aggregate functions | MIN(), MAX(), AVG(), STDDEV() |
| Preview first rows | LIMIT |
SELECT * FROM patients LIMIT 10; |
| Preview last rows | ORDER BY ... DESC LIMIT n |
SELECT * FROM patients ORDER BY id DESC LIMIT 8; |
| Count frequencies | GROUP BY + COUNT() |
SELECT gender, COUNT(*) FROM patients GROUP BY gender; |
| Count rows | COUNT(*) |
SELECT COUNT(*) FROM patients; |
| Count columns | Metadata query | SELECT COUNT(*) FROM information_schema.columns; |
| List variable names | Metadata query | SELECT column_name FROM information_schema.columns; |
| Check variable type | Metadata query | SELECT column_name, data_type FROM information_schema.columns; |
| Create data dictionary | Metadata query | (see metadata query example above) |
| Clean column/text values | SQL functions | UPPER(TRIM(col)) |
Part IV — Wrangling Core (SELECT / WHERE / CASE / JOIN)
Once a dataset has been imported and explored, the next step is wrangling — reshaping and preparing the data for analysis.
In SQL, the core tools for wrangling are:
- Selecting & aliasing (choosing columns, renaming for clarity, creating computed columns)
- Filtering, sorting, limiting (subsetting rows)
- Conditional logic (
CASE,COALESCE,NULLIF)
- Joins (combining data across multiple tables)
📊 Summary Table — Conditional Logic & Joins
The table below summarizes key SQL wrangling patterns using the global_health_statistics dataset.
It highlights creating variables with CASE, handling missing values, and common join strategies, with representative examples.
| Concept | Description | Example SQL |
|---|---|---|
| Binary indicator with CASE | Recode into two groups (like ifelse() in R). |
sql<br>SELECT disease_name, mortality_pct,<br> CASE WHEN mortality_pct > 20 THEN 'High Mortality'<br> ELSE 'Low/Moderate Mortality' END AS mortality_group<br>FROM public.global_health_statistics<br>LIMIT 10; |
| Multi-category recoding | Multiple conditions, like case_when() in R. |
sql<br>SELECT disease_name, mortality_pct,<br> CASE WHEN mortality_pct < 5 THEN 'Low'<br> WHEN mortality_pct BETWEEN 5 AND 20 THEN 'Moderate'<br> WHEN mortality_pct > 20 THEN 'High'<br> ELSE 'Unknown' END AS mortality_category<br>FROM public.global_health_statistics<br>LIMIT 10; |
| Buckets based on numeric ranges | Group continuous values into buckets. | sql<br>SELECT country, disease_name, year_event, population_affected,<br> CASE WHEN population_affected IS NULL THEN 'unknown'<br> WHEN population_affected < 1_000 THEN 'very low'<br> WHEN population_affected < 10_000 THEN 'low'<br> WHEN population_affected < 100_000 THEN 'moderate'<br> WHEN population_affected < 1_000_000 THEN 'high'<br> ELSE 'very high' END AS burden_bucket<br>FROM public.global_health_statistics<br>ORDER BY year_event DESC, population_affected DESC<br>LIMIT 10; |
| Combine multiple variables | Conditional logic across >1 column. | sql<br>SELECT disease_name, age_group,<br> CASE WHEN disease_category = 'Infectious' AND age_group = '0-14' THEN 'Child Infectious'<br> WHEN disease_category = 'Infectious' AND age_group = '65+' THEN 'Elderly Infectious'<br> WHEN disease_category = 'Non-communicable' AND age_group = '15-64' THEN 'Adult NCD'<br> ELSE 'Other' END AS combined_category<br>FROM public.global_health_statistics<br>LIMIT 10; |
| Handle missing values | COALESCE replaces NULL, NULLIF avoids divide-by-zero. |
sql<br>SELECT country, COALESCE(healthcare_access,'Unknown') AS healthcare_access<br>FROM public.global_health_statistics<br>LIMIT 10; |
| LEFT JOIN | Keep all fact rows, enrich where possible. | sql<br>SELECT g.country, COALESCE(cr.region,'Unknown') AS region,<br> g.disease_name, g.year_event, g.population_affected<br>FROM public.global_health_statistics g<br>LEFT JOIN country_region cr ON cr.country = g.country<br>WHERE g.year_event BETWEEN 2018 AND 2022<br>ORDER BY g.year_event DESC, region, g.population_affected DESC<br>LIMIT 15; |
| FULL JOIN | Keep all rows from both tables; useful for reconciliation. | sql<br>SELECT COALESCE(g.country, cr.country) AS country_key,<br> cr.region, g.disease_name, g.year_event, g.population_affected,<br> CASE WHEN g.country IS NULL THEN 'missing_in_facts'<br> WHEN cr.country IS NULL THEN 'missing_in_dim'<br> ELSE 'matched' END AS join_status<br>FROM public.global_health_statistics g<br>FULL OUTER JOIN country_region cr ON cr.country = g.country<br>LIMIT 10; |
| Anti-join | Find fact rows with no matching dimension. | sql<br>SELECT g.*<br>FROM public.global_health_statistics g<br>WHERE NOT EXISTS (SELECT 1 FROM country_region cr WHERE cr.country = g.country)<br>LIMIT 10; |
| Semi-join | Keep fact rows that have a match, without returning dimension columns. | sql<br>SELECT g.country, g.disease_name, g.year_event, g.population_affected<br>FROM public.global_health_statistics g<br>WHERE EXISTS (SELECT 1 FROM country_region cr WHERE cr.country = g.country)<br>LIMIT 10; |
For demonstration, I will continue using the global_health_statistics dataset.
1. Selecting & Aliasing
Aliasing allows us to rename columns or create new computed fields directly in queries.
Here, I renamed columns (country → country_name) and created a computed column to express affected population in millions. This makes query results easier to interpret when preparing reports.
-- Select key columns with aliases and a computed column
SELECT
country AS country_name,
disease_name AS disease,
year_event AS year,
population_affected,
ROUND(CAST(population_affected AS numeric) / 1000000, 2) AS population_millions
FROM public.global_health_statistics
LIMIT 10;| country_name | disease | year | population_affected | population_millions |
|---|---|---|---|---|
| Saudi Arabia | Parkinson’s Disease | 2005 | 994863 | 0.99 |
| Australia | Measles | 2003 | 967319 | 0.97 |
| Italy | Ebola | 2003 | 983403 | 0.98 |
| India | Ebola | 2016 | 988479 | 0.99 |
| Turkey | Hepatitis | 2001 | 995574 | 1.00 |
| Australia | Asthma | 2007 | 997674 | 1.00 |
| Saudi Arabia | Zika | 2021 | 974805 | 0.97 |
| USA | Cholera | 2020 | 997379 | 1.00 |
| Saudi Arabia | HIV/AIDS | 2012 | 984721 | 0.98 |
| UK | COVID-19 | 2023 | 999050 | 1.00 |
2. Filtering, Sorting, Limiting
Filtering rows focuses on subsets of interest. Sorting orders results, and limiting restricts the number of records returned.
The code below extracts the countries most affected by Tuberculosis after 2015, ranked by population impacted. Filtering with WHERE, ordering with ORDER BY, and subsetting with LIMIT form the core of SQL wrangling.
-- Top 10 most affected records for Tuberculosis after 2015
SELECT
country,
year_event,
disease_name,
population_affected
FROM public.global_health_statistics
WHERE disease_name = 'Tuberculosis'
AND year_event > 2015
ORDER BY population_affected DESC
LIMIT 10;| country | year_event | disease_name | population_affected |
|---|---|---|---|
| USA | 2021 | Tuberculosis | 999879 |
| South Korea | 2022 | Tuberculosis | 999833 |
| UK | 2019 | Tuberculosis | 999818 |
| South Africa | 2018 | Tuberculosis | 999784 |
| Indonesia | 2018 | Tuberculosis | 999769 |
| South Korea | 2019 | Tuberculosis | 999739 |
| South Korea | 2024 | Tuberculosis | 999711 |
| India | 2016 | Tuberculosis | 999603 |
| India | 2023 | Tuberculosis | 999487 |
| Italy | 2022 | Tuberculosis | 999477 |
When reading results, you may prefer explicit predicates, deterministic ordering, and tight limits while iterating.
--explicit predicates, deterministic ordering, and tight limits while iterating.
SELECT country, disease_name, year_event, population_affected
FROM public.global_health_statistics
WHERE year_event BETWEEN 2010 AND 2020 -- time window
AND disease_name ILIKE '%influenza%' -- case-insensitive contains
AND population_affected IS NOT NULL -- avoid surprises in math
ORDER BY year_event DESC, population_affected DESC -- deterministic, business-first order
LIMIT 10;| country | disease_name | year_event | population_affected |
|---|---|---|---|
| Brazil | Influenza | 2020 | 999885 |
| UK | Influenza | 2020 | 999365 |
| South Korea | Influenza | 2020 | 999300 |
| China | Influenza | 2020 | 998855 |
| Nigeria | Influenza | 2020 | 996164 |
| France | Influenza | 2020 | 995307 |
| Italy | Influenza | 2020 | 994765 |
| Indonesia | Influenza | 2020 | 994646 |
| Germany | Influenza | 2020 | 994263 |
| Australia | Influenza | 2020 | 992359 |
Filtering by specific letters, words or names
Filtering with LIKE and ILIKE
In PostgreSQL you can use ILIKE instead of LIKE to make the search case-insensitive.
Filtering text values is a common requirement in health datasets.
Often we need to retrieve records where a disease name, country, or other text field contains, starts with, or ends with a certain word.
In SQL, this is achieved using the LIKE operator.
- % matches any sequence of characters (including none).
- _ matches a single character.
Below is a summary table of common filtering patterns:
| Pattern | Description | SQL Example |
|---|---|---|
%Alzheimer% |
Matches any value containing “Alzheimer” | SELECT * FROM public.global_health_statistics WHERE disease_name LIKE '%Alzheimer%'; |
Alzheimer% |
Matches values that start with “Alzheimer” | SELECT * FROM public.global_health_statistics WHERE disease_name LIKE 'Alzheimer%'; |
%Alzheimer |
Matches values that end with “Alzheimer” | SELECT * FROM public.global_health_statistics WHERE disease_name LIKE '%Alzheimer'; |
%flu% |
Matches any value containing “flu” (e.g. “Influenza”, “Avian flu”) | SELECT * FROM public.global_health_statistics WHERE disease_name LIKE '%flu%'; |
_alaria |
Matches a 7-letter word ending in “alaria” (e.g. “Malaria”) | SELECT * FROM public.global_health_statistics WHERE disease_name LIKE '_alaria'; |
C% |
Matches any value starting with the letter “C” | SELECT * FROM public.global_health_statistics WHERE disease_name LIKE 'C%'; |
– Find all records where the disease name starts with “Influenza”
SELECT *
FROM public.global_health_statistics
WHERE disease_name LIKE 'Influenza%';| country | year_event | disease_name | disease_category | prevalence | incidence | mortality_pct | age_group | gender | population_affected | healthcare_access | dr_per_1000 | hospital_beds_per_1000 | treatment_type | average_treatment_cost_usd | availability_of_vaccines_treatment | recovery_rate_pct | dalys | improvement_in_5_years_pct | per_capita_income_usd | education_index | urbanization_rate_pct |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| South Korea | 2021 | Influenza | Autoimmune | 16.40 | 9.03 | 8.06 | 0-18 | Female | 991918 | 96.11 | 2.54 | 7.47 | Medication | 24235 | Yes | 84.39 | 1895 | 5.32 | 83332 | 0.62 | 41.45 |
| Russia | 2024 | Influenza | Neurological | 9.23 | 11.75 | 7.85 | 36-60 | Male | 995982 | 93.26 | 4.22 | 6.31 | Vaccination | 8893 | No | 79.07 | 2119 | 2.34 | 93844 | 0.66 | 74.08 |
| Mexico | 2007 | Influenza | Autoimmune | 14.52 | 14.32 | 9.96 | 0-18 | Female | 985023 | 89.24 | 3.35 | 3.56 | Surgery | 17797 | Yes | 67.76 | 1859 | 3.29 | 45837 | 0.63 | 31.86 |
| Nigeria | 2016 | Influenza | Neurological | 4.46 | 0.88 | 4.07 | 61+ | Male | 971438 | 99.91 | 1.35 | 5.23 | Medication | 16069 | Yes | 54.47 | 4392 | 3.98 | 42556 | 0.83 | 70.52 |
| Russia | 2005 | Influenza | Chronic | 11.17 | 5.11 | 4.62 | 61+ | Other | 978667 | 63.33 | 3.12 | 0.79 | Vaccination | 39282 | Yes | 50.16 | 1032 | 0.71 | 19698 | 0.78 | 37.30 |
| Saudi Arabia | 2016 | Influenza | Respiratory | 15.43 | 0.67 | 3.53 | 36-60 | Female | 993985 | 61.94 | 4.18 | 3.19 | Vaccination | 20822 | No | 68.19 | 3037 | 6.43 | 40410 | 0.65 | 75.44 |
| Australia | 2005 | Influenza | Autoimmune | 14.17 | 6.03 | 3.00 | 36-60 | Other | 998326 | 67.87 | 4.30 | 8.71 | Medication | 41420 | Yes | 73.73 | 3639 | 9.46 | 31151 | 0.52 | 56.19 |
| Germany | 2008 | Influenza | Genetic | 19.75 | 0.87 | 1.31 | 19-35 | Other | 976737 | 52.15 | 1.74 | 0.63 | Medication | 27479 | No | 84.86 | 2169 | 9.38 | 21133 | 0.89 | 80.84 |
| France | 2014 | Influenza | Infectious | 5.38 | 13.02 | 4.03 | 61+ | Female | 966348 | 52.44 | 4.15 | 9.89 | Surgery | 18901 | Yes | 75.84 | 990 | 0.95 | 81729 | 0.65 | 39.77 |
| Brazil | 2013 | Influenza | Metabolic | 13.00 | 3.53 | 8.28 | 36-60 | Female | 963595 | 61.27 | 2.40 | 4.45 | Vaccination | 36551 | No | 82.40 | 2619 | 1.95 | 47881 | 0.63 | 74.25 |
Use ILIKE instead of LIKE to make the search case-insensitive:
-- Case-insensitive match for "flu"
SELECT *
FROM public.global_health_statistics
WHERE disease_name ILIKE '%flu%';| country | year_event | disease_name | disease_category | prevalence | incidence | mortality_pct | age_group | gender | population_affected | healthcare_access | dr_per_1000 | hospital_beds_per_1000 | treatment_type | average_treatment_cost_usd | availability_of_vaccines_treatment | recovery_rate_pct | dalys | improvement_in_5_years_pct | per_capita_income_usd | education_index | urbanization_rate_pct |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| France | 2000 | Influenza | Neurological | 10.98 | 12.50 | 6.16 | 19-35 | Male | 972289 | 90.77 | 0.81 | 8.97 | Therapy | 20977 | Yes | 68.52 | 240 | 7.06 | 50477 | 0.52 | 54.33 |
| Saudi Arabia | 2012 | Influenza | Respiratory | 1.42 | 3.61 | 6.36 | 61+ | Other | 996854 | 64.29 | 2.52 | 7.32 | Therapy | 27310 | Yes | 88.74 | 4757 | 0.45 | 18559 | 0.87 | 20.21 |
| UK | 2004 | Influenza | Cardiovascular | 5.65 | 7.87 | 0.45 | 61+ | Male | 982164 | 50.52 | 4.30 | 6.90 | Vaccination | 357 | Yes | 50.72 | 2576 | 8.51 | 96145 | 0.72 | 45.57 |
| South Africa | 2003 | Influenza | Genetic | 18.19 | 0.77 | 1.28 | 19-35 | Male | 984296 | 78.75 | 4.08 | 3.62 | Vaccination | 36018 | Yes | 94.75 | 1958 | 4.49 | 24740 | 0.69 | 45.43 |
| Saudi Arabia | 2011 | Influenza | Neurological | 18.44 | 7.17 | 8.03 | 36-60 | Female | 991688 | 75.60 | 4.46 | 2.18 | Therapy | 5946 | No | 63.11 | 2149 | 7.49 | 60796 | 0.51 | 49.90 |
| Turkey | 2019 | Influenza | Bacterial | 17.33 | 1.57 | 8.76 | 0-18 | Female | 990389 | 69.89 | 4.80 | 9.46 | Therapy | 5303 | Yes | 62.03 | 709 | 1.20 | 54889 | 0.49 | 29.12 |
| Italy | 2014 | Influenza | Genetic | 0.58 | 11.24 | 2.86 | 61+ | Male | 971317 | 91.63 | 2.32 | 8.15 | Vaccination | 48912 | Yes | 85.41 | 3653 | 8.92 | 53230 | 0.49 | 40.62 |
| Germany | 2024 | Influenza | Chronic | 15.96 | 2.96 | 4.53 | 0-18 | Female | 994346 | 64.58 | 4.77 | 7.13 | Vaccination | 43370 | No | 54.10 | 4054 | 6.88 | 72122 | 0.47 | 26.04 |
| Australia | 2016 | Influenza | Parasitic | 1.23 | 0.33 | 7.45 | 36-60 | Other | 996915 | 55.93 | 2.55 | 2.72 | Therapy | 42703 | No | 60.02 | 336 | 4.81 | 28099 | 0.66 | 85.06 |
| China | 2008 | Influenza | Parasitic | 12.37 | 2.22 | 4.60 | 19-35 | Male | 992246 | 58.01 | 4.08 | 4.98 | Vaccination | 4833 | No | 98.34 | 117 | 2.56 | 53237 | 0.59 | 78.22 |
-- Find all records where the disease name contains "Alzheimer"
SELECT *
FROM public.global_health_statistics
WHERE disease_name LIKE '%Alzheimer%';| country | year_event | disease_name | disease_category | prevalence | incidence | mortality_pct | age_group | gender | population_affected | healthcare_access | dr_per_1000 | hospital_beds_per_1000 | treatment_type | average_treatment_cost_usd | availability_of_vaccines_treatment | recovery_rate_pct | dalys | improvement_in_5_years_pct | per_capita_income_usd | education_index | urbanization_rate_pct |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| India | 2022 | Alzheimer’s Disease | Neurological | 1.39 | 6.56 | 9.98 | 19-35 | Female | 993771 | 89.40 | 3.56 | 4.85 | Therapy | 34353 | No | 65.82 | 3919 | 3.99 | 46885 | 0.63 | 33.94 |
| Nigeria | 2002 | Alzheimer’s Disease | Viral | 15.92 | 12.36 | 6.57 | 19-35 | Male | 982502 | 76.19 | 2.47 | 6.80 | Surgery | 25648 | Yes | 98.90 | 1757 | 8.64 | 89792 | 0.79 | 46.24 |
| Turkey | 2017 | Alzheimer’s Disease | Respiratory | 15.07 | 14.96 | 8.67 | 0-18 | Female | 993217 | 79.87 | 2.17 | 2.43 | Therapy | 23187 | No | 71.03 | 2156 | 8.09 | 28630 | 0.56 | 83.17 |
| Germany | 2023 | Alzheimer’s Disease | Viral | 17.64 | 3.77 | 5.58 | 19-35 | Other | 982340 | 81.97 | 2.35 | 8.19 | Therapy | 45121 | No | 65.34 | 1861 | 3.76 | 74511 | 0.65 | 62.16 |
| Japan | 2020 | Alzheimer’s Disease | Autoimmune | 7.30 | 6.18 | 4.46 | 36-60 | Female | 994553 | 53.65 | 2.31 | 4.98 | Surgery | 36987 | No | 79.00 | 2286 | 2.89 | 80599 | 0.71 | 34.09 |
| Japan | 2015 | Alzheimer’s Disease | Viral | 14.21 | 8.87 | 8.83 | 61+ | Other | 987549 | 87.25 | 2.45 | 7.87 | Vaccination | 2382 | No | 95.62 | 1509 | 3.66 | 41121 | 0.82 | 68.78 |
| Saudi Arabia | 2005 | Alzheimer’s Disease | Genetic | 18.55 | 11.63 | 7.36 | 19-35 | Other | 989098 | 81.79 | 2.71 | 6.32 | Surgery | 38087 | Yes | 81.21 | 420 | 6.04 | 67373 | 0.74 | 32.60 |
| USA | 2012 | Alzheimer’s Disease | Parasitic | 15.02 | 0.81 | 6.39 | 0-18 | Male | 978985 | 82.86 | 3.58 | 2.62 | Therapy | 5799 | Yes | 71.14 | 1636 | 9.24 | 14485 | 0.63 | 33.76 |
| Brazil | 2014 | Alzheimer’s Disease | Bacterial | 9.72 | 0.82 | 4.23 | 19-35 | Other | 999386 | 99.29 | 3.65 | 1.96 | Vaccination | 32450 | No | 72.86 | 2464 | 4.40 | 18578 | 0.51 | 57.38 |
| Indonesia | 2011 | Alzheimer’s Disease | Respiratory | 14.59 | 1.88 | 1.37 | 0-18 | Male | 995566 | 63.94 | 3.33 | 5.84 | Surgery | 17167 | Yes | 70.50 | 1986 | 6.51 | 27794 | 0.89 | 73.16 |
Top-N per group (practical)
Top 3 countries per year by affected population.
WITH ranked AS (
SELECT
country,
year_event,
population_affected,
ROW_NUMBER() OVER (
PARTITION BY year_event
ORDER BY population_affected DESC, country ASC
) AS rn
FROM public.global_health_statistics
WHERE population_affected IS NOT NULL
)
SELECT country, year_event, population_affected
FROM ranked
WHERE rn <= 3
ORDER BY year_event, rn, country;| country | year_event | population_affected |
|---|---|---|
| Australia | 2000 | 999987 |
| Brazil | 2000 | 999982 |
| Canada | 2000 | 999970 |
| China | 2001 | 999989 |
| Italy | 2001 | 999986 |
| Russia | 2001 | 999979 |
| UK | 2002 | 999999 |
| France | 2002 | 999983 |
| China | 2002 | 999963 |
| Japan | 2003 | 999999 |
3. Creating variables and Conditional Logic
SQL lets us recode or transform values inline.
Using CASE
CASE is the SQL equivalent of ifelse() in R or if-else in Python, and it’s essential for creating categorical groupings.
-- Create a binary indicator for high mortality diseases
SELECT
disease_name,
mortality_pct,
CASE
WHEN mortality_pct > 20 THEN 'High Mortality'
ELSE 'Low/Moderate Mortality'
END AS mortality_group
FROM public.global_health_statistics
LIMIT 10;| disease_name | mortality_pct | mortality_group |
|---|---|---|
| Cancer | 4.58 | Low/Moderate Mortality |
| Hepatitis | 3.00 | Low/Moderate Mortality |
| Tuberculosis | 7.15 | Low/Moderate Mortality |
| HIV/AIDS | 6.04 | Low/Moderate Mortality |
| Hypertension | 3.42 | Low/Moderate Mortality |
| HIV/AIDS | 1.83 | Low/Moderate Mortality |
| Parkinson’s Disease | 1.04 | Low/Moderate Mortality |
| Diabetes | 7.10 | Low/Moderate Mortality |
| Cancer | 0.55 | Low/Moderate Mortality |
| Cholera | 6.68 | Low/Moderate Mortality |
Multi-Category Recoding
A CASE statement with multiple WHEN ... THEN ... clauses.
Suppose we want to categorize diseases into Low, Moderate, and High mortality risk groups based on the mortality_pct variable.
-- Multi-category classification of mortality levels
SELECT
disease_name,
mortality_pct,
CASE
WHEN mortality_pct < 5 THEN 'Low'
WHEN mortality_pct BETWEEN 5 AND 20 THEN 'Moderate'
WHEN mortality_pct > 20 THEN 'High'
ELSE 'Unknown'
END AS mortality_category
FROM public.global_health_statistics
LIMIT 10;| disease_name | mortality_pct | mortality_category |
|---|---|---|
| Zika | 3.08 | Low |
| Cholera | 2.69 | Low |
| HIV/AIDS | 0.22 | Low |
| Asthma | 8.27 | Moderate |
| Alzheimer’s Disease | 3.80 | Low |
| COVID-19 | 0.89 | Low |
| Tuberculosis | 3.45 | Low |
| HIV/AIDS | 4.25 | Low |
| Leprosy | 4.55 | Low |
| Diabetes | 1.42 | Low |
SELECT
country,
disease_name,
year_event,
population_affected,
CASE
WHEN population_affected IS NULL THEN 'unknown'
WHEN population_affected < 1_000 THEN 'very low'
WHEN population_affected < 10_000 THEN 'low'
WHEN population_affected < 100_000 THEN 'moderate'
WHEN population_affected < 1_000_000 THEN 'high'
ELSE 'very high'
END AS burden_bucket
FROM public.global_health_statistics
ORDER BY year_event DESC, population_affected DESC
LIMIT 10;| country | disease_name | year_event | population_affected | burden_bucket |
|---|---|---|---|---|
| Argentina | Measles | 2024 | 999962 | high |
| South Korea | Polio | 2024 | 999921 | high |
| Turkey | Influenza | 2024 | 999910 | high |
| India | Dengue | 2024 | 999855 | high |
| India | Polio | 2024 | 999848 | high |
| South Korea | Asthma | 2024 | 999812 | high |
| Russia | Polio | 2024 | 999791 | high |
| South Africa | Cancer | 2024 | 999754 | high |
| Italy | Diabetes | 2024 | 999751 | high |
| South Korea | Tuberculosis | 2024 | 999711 | high |
Combining Multiple Variables
We can even combine conditions from more than one variable. Here, we create a category based on both disease type and age group.
-- Conditional grouping using disease and age group
SELECT
disease_name,
age_group,
CASE
WHEN disease_category = 'Infectious' AND age_group = '0-14' THEN 'Child Infectious'
WHEN disease_category = 'Infectious' AND age_group = '65+' THEN 'Elderly Infectious'
WHEN disease_category = 'Non-communicable' AND age_group = '15-64' THEN 'Adult NCD'
ELSE 'Other'
END AS combined_category
FROM public.global_health_statistics
LIMIT 10;| disease_name | age_group | combined_category |
|---|---|---|
| Zika | 36-60 | Other |
| Malaria | 36-60 | Other |
| Measles | 0-18 | Other |
| Leprosy | 36-60 | Other |
| Malaria | 0-18 | Other |
| Influenza | 61+ | Other |
| Diabetes | 36-60 | Other |
| Polio | 0-18 | Other |
| Zika | 36-60 | Other |
| Leprosy | 61+ | Other |
Handling Missing Values (COALESCE, NULLIF)
COALESCE replaces missing values with a default. NULLIF prevents division by zero by returning NULL instead.
-- Replace null healthcare access values with 'Unknown'
SELECT
country,
disease_name,
year_event,
COALESCE(healthcare_access::text, 'Unknown') AS healthcare_access
FROM public.global_health_statistics
LIMIT 10;| country | disease_name | year_event | healthcare_access |
|---|---|---|---|
| Australia | Polio | 2020 | 60.15 |
| Italy | Zika | 2004 | 86.37 |
| Germany | Leprosy | 2021 | 82.38 |
| Nigeria | Dengue | 2005 | 72.18 |
| Germany | Influenza | 2012 | 64.23 |
| Argentina | Rabies | 2014 | 68.28 |
| UK | Zika | 2007 | 70.82 |
| South Korea | Alzheimer’s Disease | 2003 | 78.57 |
| Germany | Zika | 2016 | 97.23 |
| Mexico | Malaria | 2020 | 84.61 |
-- Calculate incidence rate safely (avoid division by zero)
SELECT
country,
disease_name,
incidence / NULLIF(population_affected, 0) AS incidence_rate
FROM public.global_health_statistics
WHERE year_event = 2020
LIMIT 10;| country | disease_name | incidence_rate |
|---|---|---|
| Italy | Ebola | 9.7e-06 |
| Turkey | COVID-19 | 2.9e-06 |
| Italy | Leprosy | 1.3e-06 |
| South Korea | Leprosy | 4.6e-06 |
| Argentina | Hepatitis | 6.9e-06 |
| Mexico | Measles | 7.3e-06 |
| Mexico | COVID-19 | 8.0e-06 |
| France | Leprosy | 3.9e-06 |
| South Korea | Ebola | 1.2e-06 |
| Turkey | Measles | 3.0e-07 |
4. Joins (INNER / LEFT / FULL)
To keep this self-contained, I’ll create two tables to demonstrate common join shapes:
country_region maps countries to regions.
disease_group maps diseases to broader categories
In a real project these would live in dim.* tables with clear stewardship.
CREATE TEMP TABLE country_region (
country TEXT PRIMARY KEY,
region TEXT NOT NULL
);INSERT INTO country_region(country, region) VALUES
('Australia','Oceania'),
('United States','North America'),
('Brazil','South America'),
('India','Asia'),
('Kenya','Africa');CREATE TEMP TABLE disease_group (
disease_name TEXT PRIMARY KEY,
disease_group TEXT NOT NULL
);INSERT INTO disease_group(disease_name, disease_group) VALUES
('Influenza A','Respiratory'),
('Influenza B','Respiratory'),
('COVID-19','Respiratory'),
('Dengue','Vector-borne'),
('Malaria','Vector-borne');LEFT JOIN (keep all facts; enrich where possible)
Prefer this when facts must be retained even if the dimension is incomplete.
In the example below, LEFT JOIN ensures all health statistics are retained, even if metadata is missing.
SELECT
g.country,
COALESCE(cr.region, 'Unknown') AS region,
g.disease_name,
g.year_event,
g.population_affected
FROM public.global_health_statistics AS g
LEFT JOIN country_region AS cr
ON cr.country = g.country
WHERE g.year_event BETWEEN 2018 AND 2022
ORDER BY g.year_event DESC, region, g.population_affected DESC
LIMIT 15;| country | region | disease_name | year_event | population_affected |
|---|---|---|---|---|
| India | Asia | Influenza | 2022 | 998735 |
| India | Asia | COVID-19 | 2022 | 998728 |
| India | Asia | Cholera | 2022 | 998331 |
| India | Asia | Polio | 2022 | 998188 |
| India | Asia | Measles | 2022 | 998057 |
| India | Asia | Cancer | 2022 | 996565 |
| India | Asia | Hypertension | 2022 | 994593 |
| India | Asia | Diabetes | 2022 | 994007 |
| India | Asia | Alzheimer’s Disease | 2022 | 993771 |
| India | Asia | Zika | 2022 | 993585 |
FULL JOIN: keep everything from both sides
Useful for reconciliation and data quality investigations.
SELECT
COALESCE(g.country, cr.country) AS country_key,
cr.region,
g.disease_name,
g.year_event,
g.population_affected,
CASE
WHEN g.country IS NULL THEN 'missing_in_facts'
WHEN cr.country IS NULL THEN 'missing_in_dim'
ELSE 'matched'
END AS join_status
FROM public.global_health_statistics AS g
FULL OUTER JOIN country_region AS cr
ON cr.country = g.country
ORDER BY join_status, country_key
LIMIT 10;| country_key | region | disease_name | year_event | population_affected | join_status |
|---|---|---|---|---|---|
| Australia | Oceania | Hypertension | 2004 | 995949 | matched |
| Australia | Oceania | Hepatitis | 2009 | 996791 | matched |
| Australia | Oceania | COVID-19 | 2000 | 995669 | matched |
| Australia | Oceania | Measles | 2021 | 977797 | matched |
| Australia | Oceania | Asthma | 2010 | 982446 | matched |
| Australia | Oceania | Malaria | 2012 | 993091 | matched |
| Australia | Oceania | Cholera | 2009 | 993428 | matched |
| Australia | Oceania | Asthma | 2015 | 966733 | matched |
| Australia | Oceania | Asthma | 2013 | 998228 | matched |
| Australia | Oceania | Leprosy | 2011 | 990548 | matched |
Joining multiple dimensions
Add both region and a disease_group, then compute region-level totals.
SELECT
cr.region,
dg.disease_group,
g.year_event,
SUM(g.population_affected) AS population_total
FROM public.global_health_statistics AS g
LEFT JOIN country_region AS cr
ON cr.country = g.country
LEFT JOIN disease_group AS dg
ON dg.disease_name = g.disease_name
WHERE g.year_event >= 2015
GROUP BY cr.region, dg.disease_group, g.year_event
ORDER BY cr.region NULLS LAST, dg.disease_group NULLS LAST, g.year_event;| region | disease_group | year_event | population_total |
|---|---|---|---|
| Asia | Respiratory | 2015 | 996752 |
| Asia | Respiratory | 2016 | 988454 |
| Asia | Respiratory | 2017 | 999807 |
| Asia | Respiratory | 2018 | 997011 |
| Asia | Respiratory | 2019 | 991913 |
| Asia | Respiratory | 2020 | 997987 |
| Asia | Respiratory | 2021 | 998308 |
| Asia | Respiratory | 2022 | 998728 |
| Asia | Respiratory | 2023 | 980264 |
| Asia | Respiratory | 2024 | 994785 |
Anti-join pattern (find orphans)
Facts with unmapped countries (dimension coverage gap):
SELECT g.*
FROM public.global_health_statistics AS g
WHERE NOT EXISTS (
SELECT 1
FROM country_region AS cr
WHERE cr.country = g.country
)
ORDER BY g.country, g.disease_name, g.year_event
LIMIT 10;| country | year_event | disease_name | disease_category | prevalence | incidence | mortality_pct | age_group | gender | population_affected | healthcare_access | dr_per_1000 | hospital_beds_per_1000 | treatment_type | average_treatment_cost_usd | availability_of_vaccines_treatment | recovery_rate_pct | dalys | improvement_in_5_years_pct | per_capita_income_usd | education_index | urbanization_rate_pct |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Argentina | 2000 | Alzheimer’s Disease | Chronic | 11.67 | 2.09 | 0.99 | 19-35 | Male | 999739 | 56.45 | 2.62 | 0.94 | Vaccination | 7431 | Yes | 93.08 | 2852 | 2.11 | 94376 | 0.45 | 89.55 |
| Argentina | 2001 | Alzheimer’s Disease | Genetic | 7.53 | 1.16 | 1.60 | 61+ | Female | 995329 | 55.56 | 2.92 | 9.01 | Therapy | 22389 | No | 89.26 | 2828 | 9.99 | 32326 | 0.71 | 51.91 |
| Argentina | 2002 | Alzheimer’s Disease | Bacterial | 19.84 | 2.36 | 7.98 | 36-60 | Female | 999312 | 80.47 | 1.66 | 7.84 | Surgery | 22607 | Yes | 52.38 | 4166 | 2.33 | 78110 | 0.82 | 71.93 |
| Argentina | 2003 | Alzheimer’s Disease | Respiratory | 2.71 | 1.16 | 6.98 | 61+ | Female | 993597 | 69.29 | 4.80 | 1.09 | Therapy | 22557 | No | 90.07 | 3888 | 8.11 | 92623 | 0.83 | 62.30 |
| Argentina | 2004 | Alzheimer’s Disease | Viral | 3.34 | 14.08 | 2.02 | 0-18 | Male | 991236 | 79.44 | 1.33 | 6.34 | Surgery | 18506 | Yes | 66.13 | 2646 | 5.26 | 70858 | 0.54 | 84.19 |
| Argentina | 2005 | Alzheimer’s Disease | Genetic | 7.95 | 2.05 | 6.92 | 61+ | Male | 994878 | 55.81 | 2.91 | 3.04 | Therapy | 8444 | Yes | 61.27 | 1902 | 2.97 | 70939 | 0.55 | 68.29 |
| Argentina | 2006 | Alzheimer’s Disease | Cardiovascular | 4.59 | 5.16 | 2.99 | 61+ | Other | 999475 | 63.33 | 4.97 | 3.40 | Medication | 14464 | Yes | 98.18 | 156 | 3.38 | 92196 | 0.55 | 33.07 |
| Argentina | 2007 | Alzheimer’s Disease | Neurological | 16.36 | 1.39 | 3.74 | 61+ | Other | 968726 | 90.92 | 1.87 | 2.74 | Medication | 10705 | No | 59.49 | 1924 | 4.10 | 40606 | 0.78 | 69.94 |
| Argentina | 2008 | Alzheimer’s Disease | Neurological | 9.55 | 4.24 | 5.27 | 0-18 | Female | 982297 | 68.95 | 2.43 | 1.69 | Surgery | 35053 | Yes | 82.13 | 3169 | 9.67 | 91423 | 0.71 | 78.84 |
| Argentina | 2009 | Alzheimer’s Disease | Autoimmune | 11.05 | 3.43 | 7.97 | 36-60 | Other | 984953 | 57.13 | 4.31 | 5.23 | Medication | 28014 | Yes | 92.87 | 759 | 2.42 | 95653 | 0.75 | 51.54 |
- Anti-join via LEFT JOIN … IS NULL (equivalent)
SELECT g.*
FROM public.global_health_statistics AS g
LEFT JOIN country_region AS cr
ON cr.country = g.country
WHERE cr.country IS NULL
ORDER BY g.country, g.disease_name, g.year_event
LIMIT 20;| country | year_event | disease_name | disease_category | prevalence | incidence | mortality_pct | age_group | gender | population_affected | healthcare_access | dr_per_1000 | hospital_beds_per_1000 | treatment_type | average_treatment_cost_usd | availability_of_vaccines_treatment | recovery_rate_pct | dalys | improvement_in_5_years_pct | per_capita_income_usd | education_index | urbanization_rate_pct |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Argentina | 2000 | Alzheimer’s Disease | Chronic | 11.67 | 2.09 | 0.99 | 19-35 | Male | 999739 | 56.45 | 2.62 | 0.94 | Vaccination | 7431 | Yes | 93.08 | 2852 | 2.11 | 94376 | 0.45 | 89.55 |
| Argentina | 2001 | Alzheimer’s Disease | Genetic | 7.53 | 1.16 | 1.60 | 61+ | Female | 995329 | 55.56 | 2.92 | 9.01 | Therapy | 22389 | No | 89.26 | 2828 | 9.99 | 32326 | 0.71 | 51.91 |
| Argentina | 2002 | Alzheimer’s Disease | Bacterial | 19.84 | 2.36 | 7.98 | 36-60 | Female | 999312 | 80.47 | 1.66 | 7.84 | Surgery | 22607 | Yes | 52.38 | 4166 | 2.33 | 78110 | 0.82 | 71.93 |
| Argentina | 2003 | Alzheimer’s Disease | Respiratory | 2.71 | 1.16 | 6.98 | 61+ | Female | 993597 | 69.29 | 4.80 | 1.09 | Therapy | 22557 | No | 90.07 | 3888 | 8.11 | 92623 | 0.83 | 62.30 |
| Argentina | 2004 | Alzheimer’s Disease | Viral | 3.34 | 14.08 | 2.02 | 0-18 | Male | 991236 | 79.44 | 1.33 | 6.34 | Surgery | 18506 | Yes | 66.13 | 2646 | 5.26 | 70858 | 0.54 | 84.19 |
| Argentina | 2005 | Alzheimer’s Disease | Genetic | 7.95 | 2.05 | 6.92 | 61+ | Male | 994878 | 55.81 | 2.91 | 3.04 | Therapy | 8444 | Yes | 61.27 | 1902 | 2.97 | 70939 | 0.55 | 68.29 |
| Argentina | 2006 | Alzheimer’s Disease | Cardiovascular | 4.59 | 5.16 | 2.99 | 61+ | Other | 999475 | 63.33 | 4.97 | 3.40 | Medication | 14464 | Yes | 98.18 | 156 | 3.38 | 92196 | 0.55 | 33.07 |
| Argentina | 2007 | Alzheimer’s Disease | Neurological | 16.36 | 1.39 | 3.74 | 61+ | Other | 968726 | 90.92 | 1.87 | 2.74 | Medication | 10705 | No | 59.49 | 1924 | 4.10 | 40606 | 0.78 | 69.94 |
| Argentina | 2008 | Alzheimer’s Disease | Neurological | 9.55 | 4.24 | 5.27 | 0-18 | Female | 982297 | 68.95 | 2.43 | 1.69 | Surgery | 35053 | Yes | 82.13 | 3169 | 9.67 | 91423 | 0.71 | 78.84 |
| Argentina | 2009 | Alzheimer’s Disease | Autoimmune | 11.05 | 3.43 | 7.97 | 36-60 | Other | 984953 | 57.13 | 4.31 | 5.23 | Medication | 28014 | Yes | 92.87 | 759 | 2.42 | 95653 | 0.75 | 51.54 |
Semi-Join Pattern (Keep Only Facts That Have a Match)
Goal: return fact rows that have at least one matching dimension row, without returning dimension columns or duplicating facts.
Canonical SQL:WHERE EXISTS (...)(preferred) orIN (subquery).
Unlike inner joins, a semi-join returns each qualifying fact at most once.
- Semi-join via EXISTS ✅ (recommended)
SELECT g.country, g.disease_name, g.year_event, g.population_affected
FROM public.global_health_statistics AS g
WHERE EXISTS (
SELECT 1
FROM country_region AS cr
WHERE cr.country = g.country
)
ORDER BY g.country, g.disease_name, g.year_event
LIMIT 10;| country | disease_name | year_event | population_affected |
|---|---|---|---|
| Australia | Alzheimer’s Disease | 2000 | 999987 |
| Australia | Alzheimer’s Disease | 2001 | 999380 |
| Australia | Alzheimer’s Disease | 2002 | 989422 |
| Australia | Alzheimer’s Disease | 2003 | 995752 |
| Australia | Alzheimer’s Disease | 2004 | 993018 |
| Australia | Alzheimer’s Disease | 2005 | 994235 |
| Australia | Alzheimer’s Disease | 2006 | 999998 |
| Australia | Alzheimer’s Disease | 2007 | 985789 |
| Australia | Alzheimer’s Disease | 2008 | 994427 |
| Australia | Alzheimer’s Disease | 2009 | 981466 |
3. Appending Data (Stacking Rows)
Appending (row-binding) is done using UNION or UNION ALL.
UNION ALL keeps duplicates.
UNION removes duplicates.
-- Append two subsets of the same table (stack rows)
SELECT country, year_event, disease_name, population_affected
FROM public.global_health_statistics
WHERE year_event <= 2015
UNION ALL
SELECT country, year_event, disease_name, population_affected
FROM public.global_health_statistics
WHERE year_event > 2015
ORDER BY year_event
LIMIT 20;| country | year_event | disease_name | population_affected |
|---|---|---|---|
| India | 2000 | Hypertension | 989212 |
| Australia | 2000 | Rabies | 999093 |
| France | 2000 | COVID-19 | 996188 |
| Canada | 2000 | HIV/AIDS | 997838 |
| Italy | 2000 | Tuberculosis | 982734 |
| Russia | 2000 | Ebola | 994461 |
| Brazil | 2000 | Ebola | 998677 |
| Russia | 2000 | Cancer | 994540 |
| Australia | 2000 | HIV/AIDS | 998771 |
| Saudi Arabia | 2000 | Dengue | 981698 |
Part V — Advanced Wrangling
Beyond basic SELECT and JOIN operations, SQL offers powerful tools for advanced wrangling:
- Common Table Expressions (CTEs) for readability and modular pipelines
- Window functions for ranking, partitioning, and moving calculations
- Deduplication using
ROW_NUMBER()
- Upserts for managing incremental updates
These techniques are essential for real-world analytics pipelines.
1. Common Table Expressions (CTEs)
A CTE lets us build queries in readable, stepwise blocks. It’s the SQL equivalent of creating intermediate objects in R with the pipe %>%.
In the example below, the first block (mortality_groups) is like an intermediate dataset. The second query reuses it for further filtering.
-- CTE to calculate mortality categories, then filter
WITH mortality_groups AS (
SELECT
country,
disease_name,
year_event,
mortality_pct,
CASE
WHEN mortality_pct > 9 THEN 'High Mortality'
ELSE 'Low/Moderate Mortality'
END AS mortality_group
FROM public.global_health_statistics
)
SELECT *
FROM mortality_groups
WHERE mortality_group = 'High Mortality'
LIMIT 10;| country | disease_name | year_event | mortality_pct | mortality_group |
|---|---|---|---|---|
| Nigeria | Malaria | 2002 | 9.97 | High Mortality |
| South Korea | Hepatitis | 2024 | 9.48 | High Mortality |
| Argentina | HIV/AIDS | 2007 | 9.65 | High Mortality |
| Brazil | Polio | 2005 | 9.60 | High Mortality |
| Australia | Malaria | 2012 | 9.50 | High Mortality |
| India | Influenza | 2002 | 9.22 | High Mortality |
| Canada | Polio | 2012 | 9.91 | High Mortality |
| Italy | Malaria | 2020 | 9.24 | High Mortality |
| Russia | Malaria | 2008 | 9.72 | High Mortality |
| Nigeria | Asthma | 2013 | 9.15 | High Mortality |
2. Window Functions
Window functions allow calculations across partitions of data without collapsing rows. They are powerful for ranking, moving averages, and comparisons.
Ranking within Groups
-- Rank diseases by affected population per country in 2020
SELECT
country,
disease_name,
population_affected,
RANK() OVER (PARTITION BY country ORDER BY population_affected DESC) AS rank_within_country
FROM public.global_health_statistics
WHERE year_event = 2020
ORDER BY country, rank_within_country
LIMIT 10;| country | disease_name | population_affected | rank_within_country |
|---|---|---|---|
| Argentina | Cancer | 999522 | 1 |
| Argentina | Dengue | 997702 | 2 |
| Argentina | COVID-19 | 997478 | 3 |
| Argentina | Hypertension | 997396 | 4 |
| Argentina | Alzheimer’s Disease | 995954 | 5 |
| Argentina | Tuberculosis | 995534 | 6 |
| Argentina | Ebola | 995350 | 7 |
| Argentina | HIV/AIDS | 992933 | 8 |
| Argentina | Diabetes | 992740 | 9 |
| Argentina | Zika | 992414 | 10 |
Calculating a rolling 3-year average for selected diseases.
-- 3-year moving average of incidence per disease
SELECT
disease_name,
year_event,
incidence,
ROUND(
(AVG(incidence) OVER (
PARTITION BY disease_name
ORDER BY year_event
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
))::numeric, 2
) AS incidence_moving_avg
FROM public.global_health_statistics
WHERE disease_name IN ('Malaria','Tuberculosis')
ORDER BY disease_name, year_event
LIMIT 15;| disease_name | year_event | incidence | incidence_moving_avg |
|---|---|---|---|
| Malaria | 2000 | 14.50 | 14.50 |
| Malaria | 2000 | 12.62 | 13.56 |
| Malaria | 2000 | 8.22 | 11.78 |
| Malaria | 2000 | 5.83 | 8.89 |
| Malaria | 2000 | 3.87 | 5.97 |
| Malaria | 2000 | 12.26 | 7.32 |
| Malaria | 2000 | 6.30 | 7.48 |
| Malaria | 2000 | 4.53 | 7.70 |
| Malaria | 2000 | 4.33 | 5.05 |
| Malaria | 2000 | 1.99 | 3.62 |
3. Deduplication with ROW_NUMBER()
Sometimes datasets contain duplicate records. We can use ROW_NUMBER() to keep the “first” occurrence and drop the rest.
Identify duplicates across all columns
The example below will show you which rows are duplicated, and how many times.
For example, if the same country + disease + year + population row appears 3 times, you’ll see duplicate_count = 3.
SELECT
country,
disease_name,
year_event,
population_affected,
COUNT(*) AS duplicate_count
FROM public.global_health_statistics
GROUP BY
country,
disease_name,
year_event,
population_affected
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC
LIMIT 10;| country | disease_name | year_event | population_affected | duplicate_count |
|---|
If you want to remove duplicates, keeping only one record per case
-- This removes duplicates across all columns.
SELECT DISTINCT *
FROM public.global_health_statistics
LIMIT 15;| country | year_event | disease_name | disease_category | prevalence | incidence | mortality_pct | age_group | gender | population_affected | healthcare_access | dr_per_1000 | hospital_beds_per_1000 | treatment_type | average_treatment_cost_usd | availability_of_vaccines_treatment | recovery_rate_pct | dalys | improvement_in_5_years_pct | per_capita_income_usd | education_index | urbanization_rate_pct |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Argentina | 2000 | Alzheimer’s Disease | Chronic | 11.67 | 2.09 | 0.99 | 19-35 | Male | 999739 | 56.45 | 2.62 | 0.94 | Vaccination | 7431 | Yes | 93.08 | 2852 | 2.11 | 94376 | 0.45 | 89.55 |
| Argentina | 2000 | Asthma | Parasitic | 6.03 | 11.82 | 4.71 | 36-60 | Male | 999306 | 65.53 | 3.85 | 2.67 | Vaccination | 34713 | Yes | 87.90 | 2810 | 0.42 | 83780 | 0.76 | 22.01 |
| Argentina | 2000 | Cancer | Bacterial | 19.72 | 7.65 | 4.07 | 61+ | Other | 980822 | 57.52 | 3.96 | 5.92 | Medication | 36545 | Yes | 56.26 | 726 | 1.43 | 86501 | 0.61 | 70.75 |
| Argentina | 2000 | Cholera | Infectious | 19.31 | 13.36 | 6.39 | 36-60 | Other | 991759 | 64.20 | 0.87 | 7.23 | Therapy | 5415 | No | 51.64 | 4251 | 7.54 | 95337 | 0.81 | 83.87 |
| Argentina | 2000 | COVID-19 | Genetic | 5.87 | 4.43 | 3.91 | 36-60 | Female | 996726 | 74.90 | 4.33 | 0.79 | Surgery | 48595 | No | 84.81 | 494 | 0.13 | 64551 | 0.41 | 46.80 |
| Argentina | 2000 | Dengue | Chronic | 6.42 | 13.52 | 4.60 | 61+ | Female | 997142 | 69.02 | 1.13 | 4.69 | Therapy | 15127 | Yes | 53.30 | 4906 | 5.08 | 98961 | 0.51 | 40.18 |
| Argentina | 2000 | Diabetes | Viral | 9.74 | 4.52 | 1.08 | 0-18 | Other | 996077 | 60.79 | 3.10 | 2.41 | Medication | 43795 | No | 56.47 | 4572 | 4.81 | 75035 | 0.80 | 68.70 |
| Argentina | 2000 | Ebola | Neurological | 11.57 | 2.21 | 7.15 | 61+ | Female | 996462 | 68.50 | 1.44 | 9.78 | Medication | 33639 | No | 98.32 | 4943 | 8.77 | 21021 | 0.63 | 74.10 |
| Argentina | 2000 | Hepatitis | Viral | 6.85 | 6.60 | 4.23 | 61+ | Other | 975406 | 83.04 | 3.48 | 3.67 | Vaccination | 17061 | No | 73.33 | 1911 | 3.90 | 37765 | 0.71 | 48.95 |
| Argentina | 2000 | HIV/AIDS | Autoimmune | 7.45 | 4.96 | 0.69 | 36-60 | Female | 996694 | 76.82 | 3.45 | 3.63 | Medication | 34932 | Yes | 78.58 | 3455 | 6.66 | 40875 | 0.77 | 89.73 |
If you only want to de-duplicate on some columns (e.g., country, disease_name, year_event), then use DISTINCT ON (Postgres-only):
SELECT DISTINCT ON (country, disease_name, year_event) *
FROM public.global_health_statistics
ORDER BY country, disease_name, year_event, population_affected DESC
LIMIT 15;| country | year_event | disease_name | disease_category | prevalence | incidence | mortality_pct | age_group | gender | population_affected | healthcare_access | dr_per_1000 | hospital_beds_per_1000 | treatment_type | average_treatment_cost_usd | availability_of_vaccines_treatment | recovery_rate_pct | dalys | improvement_in_5_years_pct | per_capita_income_usd | education_index | urbanization_rate_pct |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Argentina | 2000 | Alzheimer’s Disease | Chronic | 11.67 | 2.09 | 0.99 | 19-35 | Male | 999739 | 56.45 | 2.62 | 0.94 | Vaccination | 7431 | Yes | 93.08 | 2852 | 2.11 | 94376 | 0.45 | 89.55 |
| Argentina | 2001 | Alzheimer’s Disease | Genetic | 7.53 | 1.16 | 1.60 | 61+ | Female | 995329 | 55.56 | 2.92 | 9.01 | Therapy | 22389 | No | 89.26 | 2828 | 9.99 | 32326 | 0.71 | 51.91 |
| Argentina | 2002 | Alzheimer’s Disease | Bacterial | 19.84 | 2.36 | 7.98 | 36-60 | Female | 999312 | 80.47 | 1.66 | 7.84 | Surgery | 22607 | Yes | 52.38 | 4166 | 2.33 | 78110 | 0.82 | 71.93 |
| Argentina | 2003 | Alzheimer’s Disease | Respiratory | 2.71 | 1.16 | 6.98 | 61+ | Female | 993597 | 69.29 | 4.80 | 1.09 | Therapy | 22557 | No | 90.07 | 3888 | 8.11 | 92623 | 0.83 | 62.30 |
| Argentina | 2004 | Alzheimer’s Disease | Viral | 3.34 | 14.08 | 2.02 | 0-18 | Male | 991236 | 79.44 | 1.33 | 6.34 | Surgery | 18506 | Yes | 66.13 | 2646 | 5.26 | 70858 | 0.54 | 84.19 |
| Argentina | 2005 | Alzheimer’s Disease | Genetic | 7.95 | 2.05 | 6.92 | 61+ | Male | 994878 | 55.81 | 2.91 | 3.04 | Therapy | 8444 | Yes | 61.27 | 1902 | 2.97 | 70939 | 0.55 | 68.29 |
| Argentina | 2006 | Alzheimer’s Disease | Cardiovascular | 4.59 | 5.16 | 2.99 | 61+ | Other | 999475 | 63.33 | 4.97 | 3.40 | Medication | 14464 | Yes | 98.18 | 156 | 3.38 | 92196 | 0.55 | 33.07 |
| Argentina | 2007 | Alzheimer’s Disease | Neurological | 16.36 | 1.39 | 3.74 | 61+ | Other | 968726 | 90.92 | 1.87 | 2.74 | Medication | 10705 | No | 59.49 | 1924 | 4.10 | 40606 | 0.78 | 69.94 |
| Argentina | 2008 | Alzheimer’s Disease | Neurological | 9.55 | 4.24 | 5.27 | 0-18 | Female | 982297 | 68.95 | 2.43 | 1.69 | Surgery | 35053 | Yes | 82.13 | 3169 | 9.67 | 91423 | 0.71 | 78.84 |
| Argentina | 2009 | Alzheimer’s Disease | Autoimmune | 11.05 | 3.43 | 7.97 | 36-60 | Other | 984953 | 57.13 | 4.31 | 5.23 | Medication | 28014 | Yes | 92.87 | 759 | 2.42 | 95653 | 0.75 | 51.54 |
Show the actual duplicate rows (not just counts) This way you actually get the full duplicate rows back, along with a duplicate_count column.
SELECT *
FROM (
SELECT *,
COUNT(*) OVER (
PARTITION BY country, disease_name, year_event, population_affected
) AS duplicate_count
FROM public.global_health_statistics
) t
WHERE duplicate_count > 1
ORDER BY country, disease_name, year_event
LIMIT 10;| country | year_event | disease_name | disease_category | prevalence | incidence | mortality_pct | age_group | gender | population_affected | healthcare_access | dr_per_1000 | hospital_beds_per_1000 | treatment_type | average_treatment_cost_usd | availability_of_vaccines_treatment | recovery_rate_pct | dalys | improvement_in_5_years_pct | per_capita_income_usd | education_index | urbanization_rate_pct | duplicate_count |
|---|
-- Keep only the latest record per country/disease
WITH ranked AS (
SELECT
country,
disease_name,
year_event,
population_affected,
ROW_NUMBER() OVER (
PARTITION BY country, disease_name
ORDER BY year_event DESC
) AS rn
FROM public.global_health_statistics
)
SELECT *
FROM ranked
WHERE rn = 1
ORDER BY country, disease_name
LIMIT 20;| country | disease_name | year_event | population_affected | rn |
|---|---|---|---|---|
| Argentina | Alzheimer’s Disease | 2024 | 994386 | 1 |
| Argentina | Asthma | 2024 | 997195 | 1 |
| Argentina | Cancer | 2024 | 993535 | 1 |
| Argentina | Cholera | 2024 | 999242 | 1 |
| Argentina | COVID-19 | 2024 | 988771 | 1 |
| Argentina | Dengue | 2024 | 988472 | 1 |
| Argentina | Diabetes | 2024 | 987174 | 1 |
| Argentina | Ebola | 2024 | 995657 | 1 |
| Argentina | Hepatitis | 2024 | 990629 | 1 |
| Argentina | HIV/AIDS | 2024 | 989442 | 1 |
4. Upserts Overview
In real-world pipelines, we often receive incremental updates (new years, new countries, or corrected values). An upsert (update + insert) ensures the table stays in sync:
If a record exists → update it
If not → insert it
The ON CONFLICT clause ensures data is not duplicated. Instead, existing rows are updated.
Tip: We need to have a unique identifier for each record.
Keep only the latest record per duplicate
DELETE FROM public.global_health_statistics a
USING (
SELECT ctid, ROW_NUMBER() OVER (
PARTITION BY country, disease_name, year_event
ORDER BY population_affected DESC
) AS rn
FROM public.global_health_statistics
) b
WHERE a.ctid = b.ctid AND b.rn > 1;-- Creating the unique identifier
ALTER TABLE public.global_health_statistics
DROP CONSTRAINT IF EXISTS ghs_unique;ALTER TABLE public.global_health_statistics
ADD CONSTRAINT ghs_unique UNIQUE (country, year_event, disease_name);-- Insert or do nothing if the combination already exists
INSERT INTO public.global_health_statistics (
country, disease_name, year_event
)
VALUES (
'Australia', 'Influenza', 2022
)
ON CONFLICT (country, disease_name, year_event)
DO NOTHING;Part VI — Data Quality, Types & Dates
🧭 Overview
High-quality analysis starts with high-quality data.
This section focuses on assessing and improving data integrity, handling missing values, managing data types, and performing date/time transformations.
In practice, data rarely arrives in perfect shape — missing records, inconsistent data types, and messy strings are common.
Being able to diagnose and correct these issues directly in SQL demonstrates the ability to build analysis-ready datasets that decision-makers can trust.
Missing data occurs for many reasons: incomplete records, data-entry errors, or missing identifiers.
Understanding, profiling, and deciding how to handle these missing values are essential steps before any statistical or predictive modeling.
The workflow includes: - Profiling missing data
- Reviewing incomplete records
- Applying imputation (filling) strategies
- Forward and backward filling within patients
- Mean-based imputations
All SQL examples below are written for PostgreSQL but can be adapted easily for SQL Server or SQLite.
1️⃣ Importing and Preparing the Data
To begin, the dataset is imported into PostgreSQL as public.visit_hospital.
This process ensures that the structure matches the source CSV file (C:\temp\visit_hospital.csv).
-- Create table for hospital visit data
DROP TABLE IF EXISTS public.visit_hospital;CREATE TABLE public.visit_hospital (
-- Identifiers
id TEXT, id_text TEXT,
-- Demographics
birthdate DATE, deathdate DATE, ssn TEXT, drivers TEXT, passport TEXT,
prefix TEXT, first TEXT, middle TEXT, last TEXT, suffix TEXT,
maiden TEXT, marital TEXT, race TEXT, ethnicity TEXT, gender TEXT,
-- Location
birthplace TEXT, address TEXT, city TEXT, state TEXT, county TEXT,
fips INT, zip INT, lat NUMERIC, lon NUMERIC,
-- Healthcare / Socioeconomic
healthcare_expenses NUMERIC, healthcare_coverage NUMERIC, income NUMERIC,
-- Visit metadata
visit INT
);-- Stage the raw text date
ALTER TABLE public.visit_hospital ADD COLUMN birthdate_raw TEXT;-- Clean it into proper DATE
UPDATE public.visit_hospital
SET birthdate = TO_DATE(birthdate_raw, 'DD/MM/YYYY')
WHERE birthdate_raw ~ '^\d{2}/\d{2}/\d{4}$';-- After loading into a staging text column:
-- Stage the raw text date
ALTER TABLE public.visit_hospital ADD COLUMN IF NOT EXISTS birthdate_raw TEXT;-- COPY with:
-- - NULL 'NA' -> turn "NA" into NULL
-- - FORCE_NULL (...) -> also treat empty strings as NULL for listed columns
COPY public.visit_hospital (
id, birthdate_raw, deathdate, ssn, drivers, passport,
prefix, first, middle, last, suffix, maiden,
marital, race, ethnicity, gender,
birthplace, address, city, state, county,
fips, zip, lat, lon,
healthcare_expenses, healthcare_coverage, income,
visit, id_text
)
FROM 'C:/temp/visit_hospital.csv'
WITH (
FORMAT CSV,
HEADER TRUE,
NULL 'NA',
FORCE_NULL (birthdate_raw, deathdate, fips, zip, lat, lon,
healthcare_expenses, healthcare_coverage, income),
DELIMITER ',',
QUOTE '"',
ESCAPE '"',
ENCODING 'UTF8');UPDATE public.visit_hospital
SET birthdate = TO_DATE(birthdate_raw, 'DD/MM/YYYY')
WHERE birthdate_raw ~ '^\d{2}/\d{2}/\d{4}$';2️⃣ Profiling Missing Data
Understanding which columns have missing values — and how many — is the foundation of any data-quality review.
Before fixing missing data, we must understand how much is missing and where. This first step mimics colSums(is.na()) in R — it counts missing values for each important column.
-- Summarize missing values per selected columns
SELECT
'birthdate' AS column_name, COUNT(*) FILTER (WHERE birthdate IS NULL) AS missing_count FROM public.visit_hospital
UNION ALL SELECT 'deathdate', COUNT(*) FILTER (WHERE deathdate IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'race', COUNT(*) FILTER (WHERE race IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'income', COUNT(*) FILTER (WHERE income IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'passport', COUNT(*) FILTER (WHERE passport IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'fips', COUNT(*) FILTER (WHERE fips IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'zip', COUNT(*) FILTER (WHERE zip IS NULL) FROM public.visit_hospital
UNION ALL SELECT 'healthcare_coverage', COUNT(*) FILTER (WHERE healthcare_coverage IS NULL) FROM public.visit_hospital;| column_name | missing_count |
|---|---|
| birthdate | 16 |
| deathdate | 40 |
| race | 20 |
| income | 20 |
| passport | 12 |
| fips | 12 |
| zip | 12 |
| healthcare_coverage | 4 |
Context: Each row in this summary represents a variable and the number of NULL (missing) entries. This gives you an at-a-glance understanding of where data quality issues exist, and whether they are systematic (entire columns) or random (sporadic records).
To get the total number of missing cells across the dataset:
-- Calculate total number of missing cells across selected columns
SELECT
SUM(
(birthdate IS NULL)::INT +
(deathdate IS NULL)::INT +
(race IS NULL)::INT +
(income IS NULL)::INT +
(passport IS NULL)::INT +
(fips IS NULL)::INT +
(zip IS NULL)::INT +
(healthcare_coverage IS NULL)::INT
) AS total_missing_cells
FROM public.visit_hospital;| total_missing_cells |
|---|
| 136 |
This produces one number — the total missing count — useful for data-quality reporting or automated alerts in data pipelines.
3️⃣ Inspecting Rows with Missing Data
After identifying which columns contain missing values, it’s helpful to inspect a few affected records.
-- Display sample of records that contain missing values
SELECT id_text, visit, race, income, passport
FROM public.visit_hospital
WHERE race IS NULL OR income IS NULL OR passport IS NULL
ORDER BY id_text, visit
LIMIT 12;| id_text | visit | race | income | passport |
|---|---|---|---|---|
| Anit|Sánc|1960-05-07 | 2 | NA | NA | X9157439X |
| Anit|Sánc|1960-05-07 | 3 | white | NA | X9157439X |
| Anit|Sánc|1960-05-07 | 4 | NA | 60016 | X9157439X |
| Aure|Weis|2003-11-12 | 2 | NA | NA | X17586296X |
| Aure|Weis|2003-11-12 | 3 | white | NA | X17586296X |
| Aure|Weis|2003-11-12 | 4 | NA | 7752 | X17586296X |
| Cars|Litt|2012-09-02 | 1 | white | 71169 | NA |
| Cars|Litt|2012-09-02 | 2 | NA | NA | NA |
| Cars|Litt|2012-09-02 | 3 | NA | 70169 | NA |
| Cars|Litt|2012-09-02 | 4 | white | NA | NA |
This query mimics filter(!complete.cases(.)) in R. It reveals the context of missing values — are they clustered by person, visit, or variable? This step helps determine the appropriate imputation method (e.g., individual-level fill vs. global averages).
4️⃣ Handling Missing Values (Imputation)
Imputation means replacing missing data with estimated or default values. This section demonstrates several techniques — from simple replacements to forward/backward fills.
4.1 Simple categorical imputation: Replacement with COALESCE()
-- Replace missing passport values with 'UNKNOWN'
SELECT
id_text,
visit,
passport,
COALESCE(passport, 'UNKNOWN') AS passport_imputed
FROM public.visit_hospital
WHERE passport IS NULL
LIMIT 12;| id_text | visit | passport | passport_imputed |
|---|---|---|---|
| Cars|Litt|2012-09-02 | 1 | NA | UNKNOWN |
| Doll|Fran|2008-10-04 | 1 | NA | UNKNOWN |
| Cars|Litt|2012-09-02 | 2 | NA | UNKNOWN |
| Doll|Fran|2008-10-04 | 2 | NA | UNKNOWN |
| Cars|Litt|2012-09-02 | 3 | NA | UNKNOWN |
| Doll|Fran|2008-10-04 | 3 | NA | UNKNOWN |
| Cars|Litt|2012-09-02 | 4 | NA | UNKNOWN |
| Doll|Fran|2008-10-04 | 4 | NA | UNKNOWN |
| Jimm|Doyl|2005-04-25 | 1 | NA | UNKNOWN |
| Jimm|Doyl|2005-04-25 | 2 | NA | UNKNOWN |
COALESCE() replaces NULL with a specified value. For example, unknown passport numbers might represent unregistered patients — instead of leaving them blank, we explicitly mark them as “UNKNOWN”. This improves downstream clarity and avoids confusion between “missing” and “not applicable.”
4.2 Numeric imputations
Global mean: stable when per-ID history is sparse. Participant mean: more tailored when individuals have multiple visits.
4.2.1 Global Mean Imputation for Numeric Data
-- Compute overall average income (ignoring missing values)
SELECT AVG(income) AS global_mean_income
FROM public.visit_hospital
WHERE income IS NOT NULL;| global_mean_income |
|---|
| 96123.1 |
-- Replace missing income values with global mean
SELECT
id_text,
visit,
income,
COALESCE(income, (SELECT AVG(income) FROM public.visit_hospital WHERE income IS NOT NULL)) AS income_imputed_global
FROM public.visit_hospital
ORDER BY id_text, visit
LIMIT 12;| id_text | visit | income | income_imputed_global |
|---|---|---|---|
| Anit|Sánc|1960-05-07 | 1 | 61016 | 61016.0 |
| Anit|Sánc|1960-05-07 | 2 | NA | 96123.1 |
| Anit|Sánc|1960-05-07 | 3 | NA | 96123.1 |
| Anit|Sánc|1960-05-07 | 4 | 60016 | 60016.0 |
| Aure|Weis|2003-11-12 | 1 | 8752 | 8752.0 |
| Aure|Weis|2003-11-12 | 2 | NA | 96123.1 |
| Aure|Weis|2003-11-12 | 3 | NA | 96123.1 |
| Aure|Weis|2003-11-12 | 4 | 7752 | 7752.0 |
| Cars|Litt|2012-09-02 | 1 | 71169 | 71169.0 |
| Cars|Litt|2012-09-02 | 2 | NA | 96123.1 |
This replaces missing income values with the overall mean. While not ideal for predictive models (since it reduces variance), it provides a consistent baseline for descriptive summaries and visualization readiness.
4.2.2 Group-Level (Participant) Mean Imputation A more refined approach is to use each patient’s own average income (if available). This reduces bias when patients vary substantially in income levels.
To better respect within-person variation, we calculate mean income for each participant (id_text) and use that as a fill value.
-- Compute mean income per patient
WITH participant_mean AS (
SELECT id_text, AVG(income) AS mean_income
FROM public.visit_hospital
WHERE income IS NOT NULL
GROUP BY id_text
)
SELECT
v.id_text,
v.visit,
v.income,
COALESCE(v.income, p.mean_income) AS income_imputed_participant
FROM public.visit_hospital v
LEFT JOIN participant_mean p
ON v.id_text = p.id_text
ORDER BY v.id_text, v.visit
LIMIT 12;| id_text | visit | income | income_imputed_participant |
|---|---|---|---|
| Anit|Sánc|1960-05-07 | 1 | 61016 | 61016 |
| Anit|Sánc|1960-05-07 | 2 | NA | 60516 |
| Anit|Sánc|1960-05-07 | 3 | NA | 60516 |
| Anit|Sánc|1960-05-07 | 4 | 60016 | 60016 |
| Aure|Weis|2003-11-12 | 1 | 8752 | 8752 |
| Aure|Weis|2003-11-12 | 2 | NA | 8252 |
| Aure|Weis|2003-11-12 | 3 | NA | 8252 |
| Aure|Weis|2003-11-12 | 4 | 7752 | 7752 |
| Cars|Litt|2012-09-02 | 1 | 71169 | 71169 |
| Cars|Litt|2012-09-02 | 2 | NA | 70669 |
Context: This technique aligns with R’s group_by(id_text) %>% summarise(mean(income, na.rm = TRUE)). It produces personalized imputations and avoids using global means that may misrepresent individual economic patterns.
4.3 Forward and Backward Fill (Sequential Imputation)
When data is longitudinal (multiple visits per patient), we may want to carry values forward or backward. SQL does not have a native fill() like R’s tidyr, but sorting prepares the data for analytic solutions.
Forward and backward fills propagate the most recent known value across sequential observations — useful when attributes like race or gender should remain constant over multiple visits.
Forward filling uses the last known race value, while backward filling retrieves the next known one. These techniques are valuable when longitudinal patient data should remain internally consistent.
Why both? In visit data, the first or last known race may be the only non-missing value for a participant. Having forward and backward fills lets me choose the nearest known value relative to each visit.
Forward Fill (by id_text, ordered by visit)
-- Forward fill race within each id_text
WITH base AS (
SELECT
id_text,
visit,
race,
-- running group: increases when we hit a non-NULL race
SUM(CASE WHEN race IS NOT NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY id_text ORDER BY visit
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS grp
FROM public.visit_hospital
)
SELECT
id_text,
visit,
race,
-- carry last known non-NULL forward; stays NULL until first non-NULL appears
MAX(race) OVER (PARTITION BY id_text, grp) AS race_forward
FROM base
ORDER BY id_text, visit
LIMIT 12;| id_text | visit | race | race_forward |
|---|---|---|---|
| Anit|Sánc|1960-05-07 | 1 | white | white |
| Anit|Sánc|1960-05-07 | 2 | NA | white |
| Anit|Sánc|1960-05-07 | 3 | white | white |
| Anit|Sánc|1960-05-07 | 4 | NA | white |
| Aure|Weis|2003-11-12 | 1 | white | white |
| Aure|Weis|2003-11-12 | 2 | NA | white |
| Aure|Weis|2003-11-12 | 3 | white | white |
| Aure|Weis|2003-11-12 | 4 | NA | white |
| Cars|Litt|2012-09-02 | 1 | white | white |
| Cars|Litt|2012-09-02 | 2 | NA | white |
Backward Fill
-- Backward fill missing race values by patient
WITH base AS (
SELECT
id_text,
visit,
race,
-- running group from the right: increases when we hit a non-NULL race scanning backwards
SUM(CASE WHEN race IS NOT NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY id_text ORDER BY visit DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS grp_rev
FROM public.visit_hospital
)
SELECT
id_text,
visit,
race,
MAX(race) OVER (PARTITION BY id_text, grp_rev) AS race_backward
FROM base
ORDER BY id_text, visit
LIMIT 12;| id_text | visit | race | race_backward |
|---|---|---|---|
| Anit|Sánc|1960-05-07 | 1 | white | white |
| Anit|Sánc|1960-05-07 | 2 | NA | white |
| Anit|Sánc|1960-05-07 | 3 | white | white |
| Anit|Sánc|1960-05-07 | 4 | NA | NA |
| Aure|Weis|2003-11-12 | 1 | white | white |
| Aure|Weis|2003-11-12 | 2 | NA | white |
| Aure|Weis|2003-11-12 | 3 | white | white |
| Aure|Weis|2003-11-12 | 4 | NA | NA |
| Cars|Litt|2012-09-02 | 1 | white | white |
| Cars|Litt|2012-09-02 | 2 | NA | white |
4.4 Combined imputation review (side-by-side) The example below shows a practical “imputation report”: original vs. forward/backward race and both income fills, restricted to rows that actually had gaps. This gives reviewers a one-glance sanity check.
WITH base AS (
SELECT
id_text, visit, race, income
FROM public.visit_hospital
),
-- forward-fill race within id_text
ff_prep AS (
SELECT *,
SUM((race IS NOT NULL)::int)
OVER (PARTITION BY id_text ORDER BY visit) AS grp
FROM base
),
fwd AS (
SELECT id_text, visit,
MAX(race) OVER (PARTITION BY id_text, grp) AS race_forward
FROM ff_prep
),
-- backward-fill race within id_text (for leading NULLs)
bf_prep AS (
SELECT *,
SUM((race IS NOT NULL)::int)
OVER (PARTITION BY id_text ORDER BY visit DESC) AS grp_rev
FROM base
),
bwd AS (
SELECT id_text, visit,
MAX(race) OVER (PARTITION BY id_text, grp_rev) AS race_backward
FROM bf_prep
),
-- means for income imputation
gmean AS (
SELECT AVG(income) AS global_mean_income
FROM public.visit_hospital
WHERE income IS NOT NULL
),
pmean AS (
SELECT id_text, AVG(income) AS mean_income_id
FROM public.visit_hospital
WHERE income IS NOT NULL
GROUP BY id_text
)
SELECT
v.id_text, v.visit,
v.race,
f.race_forward,
b.race_backward,
v.income,
COALESCE(v.income, g.global_mean_income) AS income_global_mean,
COALESCE(v.income, p.mean_income_id) AS income_participant_mean
FROM public.visit_hospital v
LEFT JOIN fwd f USING (id_text, visit)
LEFT JOIN bwd b USING (id_text, visit)
CROSS JOIN gmean g
LEFT JOIN pmean p USING (id_text)
WHERE v.race IS NULL OR v.income IS NULL
ORDER BY v.id_text, v.visit NULLS FIRST
LIMIT 15;| id_text | visit | race | race_forward | race_backward | income | income_global_mean | income_participant_mean |
|---|---|---|---|---|---|---|---|
| Anit|Sánc|1960-05-07 | 2 | NA | white | white | NA | 96123.1 | 60516 |
| Anit|Sánc|1960-05-07 | 3 | white | white | white | NA | 96123.1 | 60516 |
| Anit|Sánc|1960-05-07 | 4 | NA | white | NA | 60016 | 60016.0 | 60016 |
| Aure|Weis|2003-11-12 | 2 | NA | white | white | NA | 96123.1 | 8252 |
| Aure|Weis|2003-11-12 | 3 | white | white | white | NA | 96123.1 | 8252 |
| Aure|Weis|2003-11-12 | 4 | NA | white | NA | 7752 | 7752.0 | 7752 |
| Cars|Litt|2012-09-02 | 2 | NA | white | white | NA | 96123.1 | 70669 |
| Cars|Litt|2012-09-02 | 3 | NA | white | white | 70169 | 70169.0 | 70169 |
| Cars|Litt|2012-09-02 | 4 | white | white | white | NA | 96123.1 | 70669 |
| Doll|Fran|2008-10-04 | 2 | NA | white | white | NA | 96123.1 | 187523 |
🧮 Summary Table of Missing Data Techniques
| Task / Objective | SQL Feature or Function | Purpose / What It Does | When to Use It |
|---|---|---|---|
| Profile Missing Data | COUNT(*) FILTER (WHERE col IS NULL) |
Counts how many missing (NULL) entries exist in each column | Always the first step in a data-quality audit |
| Check Total Missing Cells | Logical addition (col IS NULL)::INT summed across columns |
Produces total missing count across all variables | Useful for overall data completeness score |
| Inspect Incomplete Records | WHERE col IS NULL OR ... |
Shows which specific rows have missing values | Use to detect patterns in missingness (e.g., per patient or visit) |
| Replace Missing Text Values | COALESCE(column, 'UNKNOWN') |
Fills missing categorical data with an explicit label | Good for identifiers or attributes like passport, gender, or ethnicity |
| Replace Missing Numeric Values (Global Mean) | COALESCE(income, (SELECT AVG(income) FROM ...)) |
Fills missing numeric fields with a global average | Simple and consistent for reporting, though may reduce data variance |
| Replace Missing Numeric Values (Group Mean) | LEFT JOIN on grouped AVG() |
Calculates and fills missing values with participant-specific means | More realistic for longitudinal or panel data (e.g., income per patient) |
| Forward Fill (Downward) | LAST_VALUE(col) OVER (PARTITION BY id ORDER BY time) |
Propagates the last known non-null value forward | Use when variables remain stable over time (e.g., race, gender) |
| Backward Fill (Upward) | FIRST_VALUE(col) OVER (PARTITION BY id ORDER BY time ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) |
Fills missing values upward (from next known observation) | Complements forward fill for complete coverage |
| Type Casting for Numeric Fields | CAST(col AS NUMERIC) |
Converts text-encoded numbers to true numeric type | Needed before aggregation, averages, or mathematical operations |
| String Cleaning (Regex) | REGEXP_REPLACE(text, '[^A-Za-z\\s]', '', 'g') |
Removes non-letter characters or formatting noise | Prepares identifiers or names for joins and standardization |
| Date Conversion | TO_DATE(col, 'YYYY-MM-DD') |
Converts character dates to date objects | Enables date arithmetic, filtering, and time-based grouping |
| Date Component Extraction | EXTRACT(YEAR FROM date_col) |
Retrieves year, month, or day components | Supports monthly/annual trend analysis |
| Filtering Complete Cases | WHERE col1 IS NOT NULL AND col2 IS NOT NULL |
Keeps only fully complete rows for analysis | Appropriate for strict models or visualizations that can’t handle nulls |
Analytical Insights
- Profiling comes before cleaning. Always understand the pattern and reason behind missingness before applying fixes.
- Context drives imputation choice. Not every missing value should be replaced — sometimes absence itself carries information (e.g., missing death date = still alive).
- Group-based imputations preserve individual structure. Replacing income within each participant group respects the natural data hierarchy.
- Sequential fills replicate temporal logic. Forward and backward filling are essential for visit-based or longitudinal data where attributes should persist across time.
- Documentation ensures transparency. Each step in the SQL workflow can be reviewed, audited, or reproduced by other analysts.
Data Casting & Cleaning (Regex, Numeric/Text Conversion)
Real-world datasets often contain inconsistent or improperly formatted data — text in numeric fields, extra spaces, mixed casing, or special characters in names. These issues arise during data collection, system exports, or file conversions, and they can silently break joins, aggregations, or numeric calculations.
This section focuses on casting, cleaning, and standardizing key variables in the visit_hospital dataset.
We’ll work through examples that show how SQL handles type conversion, regex-based cleanup, and normalization of text fields.
1️⃣ Understanding Data Type Inconsistencie
Before fixing, it’s important to detect type or formatting inconsistencies.
For example, numeric columns may contain text artifacts ('N/A', 'unknown', '#NULL!'), and text columns may contain hidden characters or mixed case.
-- Preview some raw data for inspection
SELECT id_text, income, fips, zip, passport
FROM public.visit_hospital
LIMIT 12;| id_text | income | fips | zip | passport |
|---|---|---|---|---|
| Hatt|Nade|1986-07-04 | 86384 | 25017 | 2468 | X61479569X |
| Cars|Litt|2012-09-02 | 71169 | 44007 | 2861 | NA |
| Anit|Sánc|1960-05-07 | 61016 | 25009 | 1915 | X9157439X |
| Doll|Fran|2008-10-04 | 188023 | NA | NA | NA |
| Hatt|Nade|1986-07-04 | NA | 25017 | 2468 | X61479569X |
| Cars|Litt|2012-09-02 | NA | 44007 | 2861 | NA |
| Anit|Sánc|1960-05-07 | NA | 25009 | 1915 | X9157439X |
| Doll|Fran|2008-10-04 | NA | NA | NA | NA |
| Hatt|Nade|1986-07-04 | 85384 | 25017 | 2468 | X61479569X |
| Cars|Litt|2012-09-02 | 70169 | 44007 | 2861 | NA |
This step gives a sense of what we’re dealing. For example look if income which is a numeric variable contains null strings, or fips and zip contain unexpected symbols or missing digits. Knowing what’s wrong guides the cleaning strategy.
– Identify non-numeric entries in columns expected to be numeric ALTER TABLE public.visit_hospital ADD COLUMN IF NOT EXISTS income_raw TEXT, ADD COLUMN IF NOT EXISTS zip_raw TEXT, ADD COLUMN IF NOT EXISTS fips_raw TEXT, ADD COLUMN IF NOT EXISTS marital_raw TEXT, ADD COLUMN IF NOT EXISTS city_raw TEXT, ADD COLUMN IF NOT EXISTS passport_raw TEXT;
UPDATE public.visit_hospital SET income_raw = income::text, zip_raw = zip::text, fips_raw = fips::text, marital_raw = marital, city_raw = city, passport_raw = passport;
– Generate a numeric “row number” to use with % WITH numbered AS ( SELECT id, row_number() OVER (ORDER BY id) AS rn FROM public.visit_hospital ) UPDATE public.visit_hospital AS v SET income_raw = CASE WHEN n.rn % 10 = 0 THEN ‘N/A’ WHEN n.rn % 15 = 0 THEN ‘#NULL!’ ELSE v.income_raw END, zip_raw = CASE WHEN n.rn % 12 = 0 THEN ‘123’ WHEN n.rn % 13 = 0 THEN ‘12-345’ WHEN n.rn % 14 = 0 THEN ‘A12B3’ ELSE v.zip_raw END, fips_raw = CASE WHEN n.rn % 11 = 0 THEN ‘abc12’ WHEN n.rn % 17 = 0 THEN ‘1234’ ELSE v.fips_raw END, marital_raw = CASE WHEN n.rn % 8 = 0 THEN ‘Single’ WHEN n.rn % 9 = 0 THEN ‘single’ WHEN n.rn % 10 = 0 THEN ‘SINGLE’ ELSE v.marital_raw END, city_raw = CASE WHEN n.rn % 20 = 0 THEN ’ new york ’ WHEN n.rn % 21 = 0 THEN ‘CH!CAGO’ WHEN n.rn % 22 = 0 THEN ‘los_angeles’ ELSE v.city_raw END, passport_raw = CASE WHEN n.rn % 25 = 0 THEN ’’ WHEN n.rn % 26 = 0 THEN NULL ELSE v.passport_raw END FROM numbered AS n WHERE v.id = n.id;
2️⃣ Casting Numeric-Like Columns If a column expected to be numeric contains non-numeric text, direct aggregations like SUM() or AVG() will fail. We can safely convert these columns by filtering or replacing invalid entries.
2.1 Checking Invalid Numeric Values
-- Identify non-numeric entries in columns expected to be numeric
SELECT id_text, income
FROM public.visit_hospital
WHERE income::text !~ '^[0-9]+(\.[0-9]+)?$'
AND income IS NOT NULL
LIMIT 10;| id_text | income |
|---|
The regex ‘^-?(.)?$’ detects only valid integers or decimals. Anything else (e.g., ‘N/A’, ‘#NULL!’, ‘unknown’) is non-numeric. The !~ operator in PostgreSQL checks for values that do not match the regex.
2.2 Convert to Numeric Safely
SELECT
id_text,
income,
income AS income_cleaned
FROM public.visit_hospital
LIMIT 12;| id_text | income | income_cleaned |
|---|---|---|
| Hatt|Nade|1986-07-04 | 86384 | 86384 |
| Cars|Litt|2012-09-02 | 71169 | 71169 |
| Anit|Sánc|1960-05-07 | 61016 | 61016 |
| Doll|Fran|2008-10-04 | 188023 | 188023 |
| Hatt|Nade|1986-07-04 | NA | NA |
| Cars|Litt|2012-09-02 | NA | NA |
| Anit|Sánc|1960-05-07 | NA | NA |
| Doll|Fran|2008-10-04 | NA | NA |
| Hatt|Nade|1986-07-04 | 85384 | 85384 |
| Cars|Litt|2012-09-02 | 70169 | 70169 |
This avoids errors by casting only valid numeric entries. Invalid strings are replaced with NULL, allowing numeric aggregations to run safely.
3️⃣ Text Cleaning with Regex and Standardization
3.1 Clean City and Name Fields
ALTER TABLE public.visit_hospital
ADD COLUMN IF NOT EXISTS city_raw TEXT;UPDATE public.visit_hospital
SET city_raw = city;-- Remove special characters and extra spaces from text
SELECT
id_text,
city_raw,
REGEXP_REPLACE(TRIM(city_raw), '[^A-Za-z\\s]', '', 'g') AS city_cleaned
FROM public.visit_hospital
LIMIT 12;| id_text | city_raw | city_cleaned |
|---|---|---|
| Hatt|Nade|1986-07-04 | Newton | Newton |
| Cars|Litt|2012-09-02 | Attleboro | Attleboro |
| Anit|Sánc|1960-05-07 | Beverly | Beverly |
| Doll|Fran|2008-10-04 | Shrewsbury | Shrewsbury |
| Hatt|Nade|1986-07-04 | Newton | Newton |
| Cars|Litt|2012-09-02 | Attleboro | Attleboro |
| Anit|Sánc|1960-05-07 | Beverly | Beverly |
| Doll|Fran|2008-10-04 | Shrewsbury | Shrewsbury |
| Hatt|Nade|1986-07-04 | Newton | Newton |
| Cars|Litt|2012-09-02 | Attleboro | Attleboro |
TRIM() removes leading and trailing spaces.
REGEXP_REPLACE() strips non-alphabetic characters (e.g., “CH!CAGO” → “CHCAGO”).
These techniques clean text for better grouping and joining.
3.2 Standardize Text Casing
-- Normalize categorical text casing
SELECT
id_text,
marital,
LOWER(marital) AS marital_lower,
INITCAP(LOWER(marital)) AS marital_title
FROM public.visit_hospital
LIMIT 12;| id_text | marital | marital_lower | marital_title |
|---|---|---|---|
| Hatt|Nade|1986-07-04 | M | m | M |
| Cars|Litt|2012-09-02 | NA | NA | NA |
| Anit|Sánc|1960-05-07 | W | w | W |
| Doll|Fran|2008-10-04 | NA | NA | NA |
| Hatt|Nade|1986-07-04 | M | m | M |
| Cars|Litt|2012-09-02 | NA | NA | NA |
| Anit|Sánc|1960-05-07 | W | w | W |
| Doll|Fran|2008-10-04 | NA | NA | NA |
| Hatt|Nade|1986-07-04 | M | m | M |
| Cars|Litt|2012-09-02 | NA | NA | NA |
Mixed casing can create false duplicates (“Single”, “single”, “SINGLE”). LOWER() ensures uniformity for analysis, while INITCAP() is ideal for presentation-ready formats.
4️⃣ Cleaning and Padding Identifiers
4.1 ZIP Code Cleaning
ALTER TABLE public.visit_hospital
ADD COLUMN IF NOT EXISTS zip_raw TEXT;UPDATE public.visit_hospital
SET zip_raw = zip::text
WHERE zip_raw IS NULL;-- Remove non-numeric characters and pad to 5 digits
SELECT
id_text,
zip_raw,
LPAD(REGEXP_REPLACE(zip_raw, '[^0-9]', '', 'g'), 5, '0') AS zip_cleaned
FROM public.visit_hospital
LIMIT 12;| id_text | zip_raw | zip_cleaned |
|---|---|---|
| Hatt|Nade|1986-07-04 | 2468 | 02468 |
| Cars|Litt|2012-09-02 | 2861 | 02861 |
| Anit|Sánc|1960-05-07 | 1915 | 01915 |
| Doll|Fran|2008-10-04 | NA | NA |
| Hatt|Nade|1986-07-04 | 2468 | 02468 |
| Cars|Litt|2012-09-02 | 2861 | 02861 |
| Cars|Litt|2012-09-02 | 2861 | 02861 |
| Anit|Sánc|1960-05-07 | 1915 | 01915 |
| Doll|Fran|2008-10-04 | NA | NA |
| Hatt|Nade|1986-07-04 | 2468 | 02468 |
REGEXP_REPLACE() removes dashes, letters, and spaces; LPAD() ensures all ZIP codes have five digits (‘123’ → ‘00123’).
4.2 FIPS Code Cleaning
ALTER TABLE public.visit_hospital
ADD COLUMN IF NOT EXISTS fips_raw TEXT;UPDATE public.visit_hospital
SET fips_raw = fips::text
WHERE fips_raw IS NULL;-- Clean and normalize FIPS codes
SELECT
id_text,
fips_raw,
LPAD(REGEXP_REPLACE(fips_raw, '[^0-9]', '', 'g'), 5, '0') AS fips_cleaned
FROM public.visit_hospital
LIMIT 12;| id_text | fips_raw | fips_cleaned |
|---|---|---|
| Hatt|Nade|1986-07-04 | 25017 | 25017 |
| Cars|Litt|2012-09-02 | 44007 | 44007 |
| Anit|Sánc|1960-05-07 | 25009 | 25009 |
| Doll|Fran|2008-10-04 | NA | NA |
| Hatt|Nade|1986-07-04 | 25017 | 25017 |
| Cars|Litt|2012-09-02 | 44007 | 44007 |
| Cars|Litt|2012-09-02 | 44007 | 44007 |
| Anit|Sánc|1960-05-07 | 25009 | 25009 |
| Doll|Fran|2008-10-04 | NA | NA |
| Hatt|Nade|1986-07-04 | 25017 | 25017 |
Standardized FIPS codes ensure geographic consistency. Removing non-numeric characters prevents join failures when linking with regional or census datasets.
5️⃣ Categorical & Identifier Normalization
ALTER TABLE public.visit_hospital
ADD COLUMN IF NOT EXISTS marital_raw TEXT;UPDATE public.visit_hospital
SET marital_raw = marital
WHERE marital_raw IS NULL;-- Identify distinct marital variants
SELECT DISTINCT marital_raw
FROM public.visit_hospital
ORDER BY marital_raw;| marital_raw |
|---|
| D |
| M |
| S |
| W |
| NA |
Quickly reveals variations such as “single”, “Single”, and “SINGLE” — common in manually entered survey or registry data.
6️⃣ Before vs After: Data Cleaning Summary View
The next query displays raw vs. cleaned columns side-by-side, allowing a clear visual comparison for documentation or dashboards.
-- Add raw text versions of numeric/text fields for cleaning
ALTER TABLE public.visit_hospital
ADD COLUMN IF NOT EXISTS income_raw TEXT,
ADD COLUMN IF NOT EXISTS zip_raw TEXT,
ADD COLUMN IF NOT EXISTS fips_raw TEXT,
ADD COLUMN IF NOT EXISTS marital_raw TEXT,
ADD COLUMN IF NOT EXISTS city_raw TEXT,
ADD COLUMN IF NOT EXISTS passport_raw TEXT;
-- Populate them from existing columns
UPDATE public.visit_hospital
SET
income_raw = income::text,
zip_raw = zip::text,
fips_raw = fips::text,
marital_raw = marital,
city_raw = city,
passport_raw = passport
WHERE income_raw IS NULL
OR zip_raw IS NULL
OR fips_raw IS NULL
OR marital_raw IS NULL
OR city_raw IS NULL
OR passport_raw IS NULL;-- Create a clean comparison view of selected variables
SELECT
id_text,
income_raw,
CASE
WHEN income_raw ~ '^\s*-?\d+(\.\d+)?\s*$'
THEN CAST(TRIM(income_raw) AS NUMERIC)
ELSE NULL
END AS income_cleaned,
zip_raw,
LPAD(REGEXP_REPLACE(zip_raw, '[^0-9]', '', 'g'), 5, '0') AS zip_cleaned,
fips_raw,
LPAD(REGEXP_REPLACE(fips_raw, '[^0-9]', '', 'g'), 5, '0') AS fips_cleaned,
marital_raw,
INITCAP(LOWER(marital_raw)) AS marital_cleaned,
city_raw,
REGEXP_REPLACE(TRIM(city_raw), '[^A-Za-z\\s]', '', 'g') AS city_cleaned,
passport_raw,
COALESCE(NULLIF(TRIM(passport_raw), ''), 'UNKNOWN') AS passport_cleaned
FROM public.visit_hospital
ORDER BY id_text
LIMIT 20;| id_text | income_raw | income_cleaned | zip_raw | zip_cleaned | fips_raw | fips_cleaned | marital_raw | marital_cleaned | city_raw | city_cleaned | passport_raw | passport_cleaned |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Anit|Sánc|1960-05-07 | NA | NA | 1915 | 01915 | 25009 | 25009 | W | W | Beverly | Beverly | X9157439X | X9157439X |
| Anit|Sánc|1960-05-07 | 61016 | 61016 | 1915 | 01915 | 25009 | 25009 | W | W | Beverly | Beverly | X9157439X | X9157439X |
| Anit|Sánc|1960-05-07 | 60016 | 60016 | 1915 | 01915 | 25009 | 25009 | W | W | Beverly | Beverly | X9157439X | X9157439X |
| Anit|Sánc|1960-05-07 | NA | NA | 1915 | 01915 | 25009 | 25009 | W | W | Beverly | Beverly | X9157439X | X9157439X |
| Aure|Weis|2003-11-12 | NA | NA | 1606 | 01606 | 25027 | 25027 | NA | NA | Worcester | Worcester | X17586296X | X17586296X |
| Aure|Weis|2003-11-12 | 7752 | 7752 | 1606 | 01606 | 25027 | 25027 | NA | NA | Worcester | Worcester | X17586296X | X17586296X |
| Aure|Weis|2003-11-12 | 8752 | 8752 | 1606 | 01606 | 25027 | 25027 | NA | NA | Worcester | Worcester | X17586296X | X17586296X |
| Aure|Weis|2003-11-12 | NA | NA | 1606 | 01606 | 25027 | 25027 | NA | NA | Worcester | Worcester | X17586296X | X17586296X |
| Cars|Litt|2012-09-02 | 71169 | 71169 | 2861 | 02861 | 44007 | 44007 | NA | NA | Attleboro | Attleboro | NA | UNKNOWN |
| Cars|Litt|2012-09-02 | NA | NA | 2861 | 02861 | 44007 | 44007 | NA | NA | Attleboro | Attleboro | NA | UNKNOWN |
This “before and after” display is ideal for portfolio demonstrations — clearly showing how SQL transformations standardize messy data.
🧮 Summary of Techniques and Insights
| Task / Goal | SQL Technique | Purpose / What It Fixes | Analytical Context / When to Use |
|---|---|---|---|
| Detect invalid numeric entries | !~ '^\s*-?\d+(\.\d+)?\s*$' |
Finds non-numeric strings in numeric columns | Ensures numeric operations (e.g., AVG, SUM) don’t fail or misreport |
| Safe numeric casting | CASE WHEN col ~ regex THEN CAST(col AS NUMERIC) |
Converts only valid entries to numeric | Prevents casting errors during data type conversions |
| Trim and clean text fields | TRIM() |
Removes extra spaces from names or city fields | Useful for manual-entry data or multi-system merges |
| Remove unwanted characters | REGEXP_REPLACE(col, '[^A-Za-z\\s]', '', 'g') |
Cleans text of digits, punctuation, and symbols | Standardizes strings for grouping, joining, or machine learning models |
| Normalize text casing | LOWER(), UPPER(), INITCAP() |
Converts inconsistent casing to standard formats | Ensures accurate grouping and consistent visual presentation |
| Clean ZIP and FIPS codes | REGEXP_REPLACE() + LPAD() |
Removes non-numeric characters and pads missing zeros | Prepares identifiers for geographic joins and validation |
| Handle blank or missing categorical fields | COALESCE(NULLIF(TRIM(col), ''), 'UNKNOWN') |
Converts blank text and NULLs to clear labels | Makes categorical variables explicit for dashboards and reporting |
| Detect inconsistent category variants | SELECT DISTINCT LOWER(col) |
Finds duplicated categories differing only by case | Helps consolidate values before aggregation |
| Preview before vs. after cleaning | Raw vs. cleaned comparison query | Shows immediate data improvements | Ideal for portfolio documentation and QA verification |
Key Takeaways
Data type consistency ensures that every column supports its intended operation (numeric → arithmetic, text → joins).
Regex-based cleaning enables flexible and repeatable correction of messy or user-entered data.
Case and category normalization eliminate hidden duplicates, improving grouping accuracy.
Padding identifiers keeps cross-referenced codes (like ZIP or FIPS) aligned with external sources.
Blank-field handling with COALESCE() ensures clarity between “missing” and “not applicable.”
SQL-first cleaning supports reproducibility and scalability — ideal for enterprise pipelines and analytical environments.
Date & Time Basics (conversion, extraction, validation)
Dates and times are among the most critical yet error-prone data types in analytics.
Whether you’re working with patient visits, transactions, or time-series indicators, proper handling of temporal data ensures analytical accuracy and chronological integrity.
This section demonstrates key SQL techniques for date conversion, validation, and temporal analysis using the visit_hospital dataset.
Inspecting and Profiling Date Columns
The dataset includes birthdate, deathdate, and a visit number field (visit).
Let’s quickly inspect their contents and verify their types.
-- Review the date columns to understand their current formats
SELECT
id_text,
birthdate,
deathdate,
visit
FROM public.visit_hospital
LIMIT 10;| id_text | birthdate | deathdate | visit |
|---|---|---|---|
| Hatt|Nade|1986-07-04 | NA | NA | 1 |
| Cars|Litt|2012-09-02 | NA | NA | 1 |
| Anit|Sánc|1960-05-07 | NA | NA | 1 |
| Doll|Fran|2008-10-04 | NA | NA | 1 |
| Cars|Litt|2012-09-02 | NA | NA | 2 |
| Cars|Litt|2012-09-02 | NA | NA | 3 |
| Anit|Sánc|1960-05-07 | NA | NA | 3 |
| Doll|Fran|2008-10-04 | NA | NA | 3 |
| Hatt|Nade|1986-07-04 | NA | NA | 4 |
| Cars|Litt|2012-09-02 | NA | NA | 4 |
Creating Raw Text Versions of Date Fields
This initial inspection gives a quick sense of what we’re working with — typically clean and consistent at this stage.
Detect Invalid or Impossible Dates
A common quality check involves identifying logically impossible dates, such as birthdates in the future or deathdates earlier than birthdates.
-- Classify raw dates by validity and pattern
WITH parts AS (
SELECT
id_text,
birthdate_raw,
CASE
WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 1)::int
END AS year,
CASE
WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 2)::int
END AS month,
CASE
WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 3)::int
END AS day
FROM public.visit_hospital
),
classified AS (
SELECT
id_text,
birthdate_raw,
CASE
WHEN year IS NOT NULL THEN
CASE
WHEN month BETWEEN 1 AND 12
AND day BETWEEN 1 AND 31
THEN 'Valid YYYY-MM-DD'
ELSE 'Invalid YYYY-MM-DD'
END
ELSE 'Other/Unknown Format'
END AS pattern
FROM parts
)
SELECT pattern, COUNT(*) AS record_count
FROM classified
GROUP BY pattern
ORDER BY record_count DESC;| pattern | record_count |
|---|---|
| Other/Unknown Format | 32 |
| Valid YYYY-MM-DD | 4 |
| Invalid YYYY-MM-DD | 4 |
This profiling identifies all structural variations. Typical outputs include YYYY-MM-DD, YYYY/MM/DD, DD-MM-YYYY, ISO with time, and residual invalid or sentinel categories. Such analysis helps define your cleaning strategy precisely.
Profiling Valid vs. Invalid Date Patterns
We first split potential YYYY-MM-DD entries into numeric components and classify them as valid, invalid, or unknown/other based on logical rules.
-- Classify raw dates by validity and pattern
WITH parts AS (
SELECT
id_text,
birthdate_raw,
CASE
WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 1)::int
END AS year,
CASE
WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 2)::int
END AS month,
CASE
WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 3)::int
END AS day
FROM public.visit_hospital
),
classified AS (
SELECT
id_text,
birthdate_raw,
CASE
WHEN year IS NOT NULL THEN
CASE
WHEN month BETWEEN 1 AND 12
AND day BETWEEN 1 AND 31
THEN 'Valid YYYY-MM-DD'
ELSE 'Invalid YYYY-MM-DD'
END
ELSE 'Other/Unknown Format'
END AS pattern
FROM parts
)
SELECT pattern, COUNT(*) AS record_count
FROM classified
GROUP BY pattern
ORDER BY record_count DESC;| pattern | record_count |
|---|---|
| Other/Unknown Format | 32 |
| Valid YYYY-MM-DD | 4 |
| Invalid YYYY-MM-DD | 4 |
split_part() extracts year, month, and day components.
The CASE logic validates ranges without immediately casting — avoiding date/time field out of range errors.
This provides a count of valid vs. invalid entries before proceeding.
Previewing Invalid and Unrecognized Formats
To inspect the actual records in those categories, filter directly on the pattern column.
-- Inspect rows with invalid or unknown date formats
WITH parts AS (
SELECT
id_text,
birthdate_raw,
CASE
WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 1)::int
END AS year,
CASE
WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 2)::int
END AS month,
CASE
WHEN birthdate_raw ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(birthdate_raw, '-', 3)::int
END AS day
FROM public.visit_hospital
),
classified AS (
SELECT
id_text,
birthdate_raw,
CASE
WHEN year IS NOT NULL THEN
CASE
WHEN month BETWEEN 1 AND 12
AND day BETWEEN 1 AND 31
THEN 'Valid YYYY-MM-DD'
ELSE 'Invalid YYYY-MM-DD'
END
ELSE 'Other/Unknown Format'
END AS pattern
FROM parts
)
SELECT id_text, birthdate_raw, pattern
FROM classified
WHERE pattern IN ('Invalid YYYY-MM-DD', 'Other/Unknown Format')
ORDER BY pattern, id_text
LIMIT 20;| id_text | birthdate_raw | pattern |
|---|---|---|
| Home|Quiñ|1955-10-26 | 1984-13-24 | Invalid YYYY-MM-DD |
| Home|Quiñ|1955-10-26 | 1984-13-24 | Invalid YYYY-MM-DD |
| Home|Quiñ|1955-10-26 | 1984-13-24 | Invalid YYYY-MM-DD |
| Home|Quiñ|1955-10-26 | 1984-13-24 | Invalid YYYY-MM-DD |
| Anit|Sánc|1960-05-07 | NA | Other/Unknown Format |
| Anit|Sánc|1960-05-07 | NA | Other/Unknown Format |
| Anit|Sánc|1960-05-07 | NA | Other/Unknown Format |
| Anit|Sánc|1960-05-07 | NA | Other/Unknown Format |
| Cars|Litt|2012-09-02 | 1986/07/04 | Other/Unknown Format |
| Cars|Litt|2012-09-02 | 1986/07/04 | Other/Unknown Format |
This step lists actual problematic records — the type of visual audit analysts perform before applying corrections. It allows you to verify whether errors are structural (e.g., 1984-13-24) or represent other issues such as placeholders (NA, blanks, or unrecognized symbols).
Repairing and Formatting Invalid Dates
Invalid entries can be normalized through conditional parsing, string manipulation, or replacement rules. The example below demonstrates three repair techniques:
Convert slashed or dashed formats consistently.
Replace sentinel values (NA, #NULL!) with NULL.
Remove invalid or non-parsable text safely.
-- Clean and standardize textual date representations
SELECT
id_text,
birthdate_raw,
CASE
WHEN birthdate_raw IN ('NA', '#NULL!', '') THEN NULL
WHEN birthdate_raw ~ '^\s*\d{4}/\d{2}/\d{2}\s*$' THEN REPLACE(birthdate_raw, '/', '-') -- unify separator
WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$' THEN
TO_CHAR(TO_DATE(TRIM(birthdate_raw), 'DD-MM-YYYY'), 'YYYY-MM-DD') -- convert to ISO
WHEN birthdate_raw ~ '^\s*\d{4}-\d{2}-\d{2}\s*$' THEN TRIM(birthdate_raw) -- already clean
ELSE NULL
END AS birthdate_formatted
FROM public.visit_hospital
LIMIT 20;| id_text | birthdate_raw | birthdate_formatted |
|---|---|---|
| Hatt|Nade|1986-07-04 | NA | NA |
| Cars|Litt|2012-09-02 | 1986/07/04 | 1986-07-04 |
| Anit|Sánc|1960-05-07 | NA | NA |
| Doll|Fran|2008-10-04 | NA | NA |
| Cars|Litt|2012-09-02 | 1986/07/04 | 1986-07-04 |
| Cars|Litt|2012-09-02 | 1986/07/04 | 1986-07-04 |
| Anit|Sánc|1960-05-07 | NA | NA |
| Doll|Fran|2008-10-04 | NA | NA |
| Dwig|Hill|1984-08-24 | NA | NA |
| Tamb|Bali|1953-05-10 | NA | NA |
This conversion converts all compatible formats into the standard YYYY-MM-DD.
Invalid, unrecognized, or sentinel entries are replaced with NULL.
Using TO_CHAR() ensures consistent ISO formatting for presentation or export.
Date-cleaning logic to ensure that all conversions are type-safe, format-consistent, and free of range errors.
Validating both the structure and value ranges of date components before conversion, we can confidently transform diverse text-based inputs into clean and usable DATE fields.
-- Safely convert and validate mixed-format date strings
WITH cleaned AS (
SELECT
id_text,
CASE
-- Handle missing or placeholder values
WHEN birthdate_raw IN ('NA', '#NULL!', '') THEN NULL
-- Reformat YYYY/MM/DD → YYYY-MM-DD
WHEN birthdate_raw ~ '^\s*\d{4}/\d{2}/\d{2}\s*$'
THEN REPLACE(birthdate_raw, '/', '-')
-- Convert DD-MM-YYYY → YYYY-MM-DD
WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$'
THEN TO_CHAR(TO_DATE(TRIM(birthdate_raw), 'DD-MM-YYYY'), 'YYYY-MM-DD')
-- Keep YYYY-MM-DD only if month/day are in valid ranges
WHEN birthdate_raw ~ '^\s*\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])\s*$'
THEN TRIM(birthdate_raw)
ELSE NULL
END AS birthdate_clean_text
FROM public.visit_hospital
)
SELECT
id_text,
birthdate_clean_text,
CASE
-- Only cast strings that pass full structural and range validation
WHEN birthdate_clean_text ~ '^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$'
THEN TO_DATE(birthdate_clean_text, 'YYYY-MM-DD')
ELSE NULL
END AS birthdate_final
FROM cleaned
WHERE birthdate_clean_text IS NOT NULL
LIMIT 15;| id_text | birthdate_clean_text | birthdate_final |
|---|---|---|
| Cars|Litt|2012-09-02 | 1986-07-04 | NA |
| Cars|Litt|2012-09-02 | 1986-07-04 | NA |
| Cars|Litt|2012-09-02 | 1986-07-04 | NA |
| Aure|Weis|2003-11-12 | 2003-11-12 | 2003-11-12 |
| Mari|Lind|1966-06-23 | 1986-04-07 | 1986-04-07 |
| Aure|Weis|2003-11-12 | 2003-11-12 | 2003-11-12 |
| Mari|Lind|1966-06-23 | 1986-04-07 | 1986-04-07 |
| Mari|Lind|1966-06-23 | 1986-04-07 | 1986-04-07 |
| Aure|Weis|2003-11-12 | 2003-11-12 | 2003-11-12 |
| Mari|Lind|1966-06-23 | 1986-04-07 | 1986-04-07 |
The inner CASE logic prepares a standardized YYYY-MM-DD text representation.
Regex guards ensure only syntactically and semantically valid month/day values are converted.
Invalid, blank, or sentinel entries are safely excluded.
The outer CASE performs the final casting into the true PostgreSQL DATE type.
This two-step validation (first reformat, then convert) guarantees that your code runs without any 22008: date/time field value out of range errors, even on messy or mixed-format datasets.
Preview of Successfully Cleaned Dates
-- Preview cleaned and validated dates
WITH cleaned AS (
SELECT
id_text,
CASE
WHEN birthdate_raw IN ('NA', '#NULL!', '') THEN NULL
WHEN birthdate_raw ~ '^\s*\d{4}/\d{2}/\d{2}\s*$' THEN REPLACE(birthdate_raw, '/', '-')
WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$' THEN TO_CHAR(TO_DATE(TRIM(birthdate_raw), 'DD-MM-YYYY'), 'YYYY-MM-DD')
WHEN birthdate_raw ~ '^\s*\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])\s*$' THEN TRIM(birthdate_raw)
ELSE NULL
END AS birthdate_clean_text
FROM public.visit_hospital
)
SELECT
id_text,
birthdate_clean_text,
TO_DATE(birthdate_clean_text, 'YYYY-MM-DD') AS birthdate_final
FROM cleaned
WHERE birthdate_clean_text IS NOT NULL
ORDER BY id_text
LIMIT 20;| id_text | birthdate_clean_text | birthdate_final |
|---|---|---|
| Aure|Weis|2003-11-12 | 2003-11-12 | 2003-11-12 |
| Aure|Weis|2003-11-12 | 2003-11-12 | 2003-11-12 |
| Aure|Weis|2003-11-12 | 2003-11-12 | 2003-11-12 |
| Aure|Weis|2003-11-12 | 2003-11-12 | 2003-11-12 |
| Cars|Litt|2012-09-02 | 1986-07-04 | 1986-07-04 |
| Cars|Litt|2012-09-02 | 1986-07-04 | 1986-07-04 |
| Cars|Litt|2012-09-02 | 1986-07-04 | 1986-07-04 |
| Cars|Litt|2012-09-02 | 1986-07-04 | 1986-07-04 |
| Mari|Lind|1966-06-23 | 1986-04-07 | 1986-04-07 |
| Mari|Lind|1966-06-23 | 1986-04-07 | 1986-04-07 |
This preview confirms that only valid and correctly formatted entries remain. All date strings are now standardized in ISO format (YYYY-MM-DD) and safely castable into the DATE type for further analysis.
Detecting Residual Missing or Invalid Entries
It’s useful to summarize how many rows were successfully cleaned versus how many remain missing or invalid. This summary provides a data-quality snapshot — showing the proportion of records that were valid, converted, or excluded.
-- Summarize validity of birthdate entries after cleaning
WITH final_status AS (
SELECT
CASE
WHEN birthdate_raw IN ('NA', '#NULL!', '') THEN 'Missing / Placeholder'
WHEN birthdate_raw ~ '^\s*\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])\s*$' THEN 'Valid (YYYY-MM-DD)'
WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$' THEN 'Converted (DD-MM-YYYY)'
WHEN birthdate_raw ~ '^\s*\d{4}/\d{2}/\d{2}\s*$' THEN 'Converted (YYYY/MM/DD)'
ELSE 'Other / Invalid'
END AS category
FROM public.visit_hospital
)
SELECT category, COUNT(*) AS record_count
FROM final_status
GROUP BY category
ORDER BY record_count DESC;| category | record_count |
|---|---|
| Missing / Placeholder | 20 |
| Other / Invalid | 8 |
| Converted (YYYY/MM/DD) | 4 |
| Valid (YYYY-MM-DD) | 4 |
| Converted (DD-MM-YYYY) | 4 |
List the invalid and unknown rows
WITH normalized AS (
SELECT
id_text,
birthdate_raw,
NULLIF(TRIM(birthdate_raw), '') AS raw_trim,
CASE
WHEN birthdate_raw ~ '^\s*\d{4}[-/]\d{2}[-/]\d{2}\s*$'
THEN REGEXP_REPLACE(TRIM(birthdate_raw), '/', '-', 'g')
WHEN birthdate_raw ~ '^\s*\d{2}-\d{2}-\d{4}\s*$'
THEN TO_CHAR(TO_DATE(TRIM(birthdate_raw), 'DD-MM-YYYY'), 'YYYY-MM-DD')
ELSE NULL
END AS ymd_text
FROM public.visit_hospital
),
parts AS (
SELECT
id_text, birthdate_raw, raw_trim, ymd_text,
CASE WHEN ymd_text ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(ymd_text,'-',1)::int END AS y,
CASE WHEN ymd_text ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(ymd_text,'-',2)::int END AS m,
CASE WHEN ymd_text ~ '^\d{4}-\d{2}-\d{2}$' THEN split_part(ymd_text,'-',3)::int END AS d
FROM normalized
),
validated AS (
SELECT
id_text, birthdate_raw, ymd_text, y, m, d,
(m BETWEEN 1 AND 12)
AND (d BETWEEN 1 AND EXTRACT(DAY FROM (date_trunc('month', make_date(y,m,1))
+ INTERVAL '1 month' - INTERVAL '1 day'))) AS is_valid
FROM parts
)
SELECT
id_text,
birthdate_raw,
ymd_text AS normalized_text,
y, m, d,
CASE
WHEN ymd_text IS NOT NULL AND NOT is_valid THEN 'Invalid (impossible date)'
WHEN ymd_text IS NULL THEN 'Unknown/Unparsed'
END AS reason
FROM validated
WHERE (ymd_text IS NOT NULL AND NOT is_valid)
OR ymd_text IS NULL
-- exclude explicit placeholders if you like:
AND COALESCE(TRIM(birthdate_raw),'') NOT IN ('NA','#NULL!','')
ORDER BY reason, id_text
LIMIT 20;| id_text | birthdate_raw | normalized_text | y | m | d | reason |
|---|---|---|---|---|---|---|
| Home|Quiñ|1955-10-26 | 1984-13-24 | 1984-13-24 | 1984 | 13 | 24 | Invalid (impossible date) |
| Home|Quiñ|1955-10-26 | 1984-13-24 | 1984-13-24 | 1984 | 13 | 24 | Invalid (impossible date) |
| Home|Quiñ|1955-10-26 | 1984-13-24 | 1984-13-24 | 1984 | 13 | 24 | Invalid (impossible date) |
| Home|Quiñ|1955-10-26 | 1984-13-24 | 1984-13-24 | 1984 | 13 | 24 | Invalid (impossible date) |
Range Profiling of Cleaned Dates
After cleaning and conversion to valide dates, you can check for chronological plausibility (e.g., no future birthdates).
-- Profile chronological range of cleaned birthdates
WITH validated AS (
SELECT
TO_DATE(TRIM(birthdate_raw), 'YYYY-MM-DD') AS birthdate_final
FROM public.visit_hospital
WHERE birthdate_raw ~ '^\s*\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])\s*$'
)
SELECT
MIN(birthdate_final) AS earliest_birth,
MAX(birthdate_final) AS latest_birth,
COUNT(*) AS total_valid
FROM validated;| earliest_birth | latest_birth | total_valid |
|---|---|---|
| 2003-11-12 | 2003-11-12 | 4 |
Summary of Date Formatting
| Step | Technique | Purpose |
|---|---|---|
| Pattern profiling | ~ (regex match) + split_part() |
Identify valid, invalid, or mixed-format entries |
| Safe reformatting | REPLACE() + TO_CHAR(TO_DATE(...)) |
Convert raw input to ISO YYYY-MM-DD text safely |
| Range validation | Regex guard (0[1-9]|1[0-2]) for month, etc. |
Block out-of-range month/day before conversion |
| Safe conversion | CASE WHEN ... THEN TO_DATE() |
Cast only confirmed valid formats to DATE |
| Quality summary | GROUP BY category |
Provide audit view of data quality categories |
| Range check | MIN() / MAX() |
Confirm chronological plausibility of cleaned dates |
Date & Time Analysis
Profiling the Date of Birth Column
Start with an overview to check completeness and plausible ranges.
-- Profile birthdate data: completeness and range
SELECT
COUNT(*) AS total_records,
COUNT(birthdate) AS non_missing_birthdate,
MIN(birthdate) AS earliest_birth,
MAX(birthdate) AS latest_birth
FROM public.visit_hospital;| total_records | non_missing_birthdate | earliest_birth | latest_birth |
|---|---|---|---|
| 40 | 24 | 1953-05-10 | 2005-04-25 |
This confirms the time coverage of your data (e.g., from 1953–2012) and helps detect potential outliers or missing values.
Calculating births by Month and Year
-- Count births by month
SELECT
DATE_TRUNC('month', birthdate)::date AS birth_month,
COUNT(*) AS total_births
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
GROUP BY 1
ORDER BY birth_month;| birth_month | total_births |
|---|---|
| 1953-05-01 | 4 |
| 1955-10-01 | 4 |
| 1966-06-01 | 4 |
| 1984-08-01 | 4 |
| 2003-11-01 | 4 |
| 2005-04-01 | 4 |
-- Births by Year (show only the year number)
SELECT
EXTRACT(YEAR FROM birthdate)::int AS birth_year,
COUNT(*) AS total_births
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
GROUP BY 1
ORDER BY birth_year;| birth_year | total_births |
|---|---|
| 1953 | 4 |
| 1955 | 4 |
| 1966 | 4 |
| 1984 | 4 |
| 2003 | 4 |
| 2005 | 4 |
Calculating Current Age in Years The AGE() function calculates an interval between two dates, while DATE_PART() extracts the number of years.
-- Compute current age in years for each individual
SELECT
id_text,
birthdate,
DATE_PART('year', AGE(CURRENT_DATE, birthdate)) AS age_years
FROM public.visit_hospital
ORDER BY age_years DESC
LIMIT 15;| id_text | birthdate | age_years |
|---|---|---|
| Hatt|Nade|1986-07-04 | NA | NA |
| Anit|Sánc|1960-05-07 | NA | NA |
| Cars|Litt|2012-09-02 | NA | NA |
| Doll|Fran|2008-10-04 | NA | NA |
| Doll|Fran|2008-10-04 | NA | NA |
| Hatt|Nade|1986-07-04 | NA | NA |
| Doll|Fran|2008-10-04 | NA | NA |
| Doll|Fran|2008-10-04 | NA | NA |
| Cars|Litt|2012-09-02 | NA | NA |
| Hatt|Nade|1986-07-04 | NA | NA |
AGE(CURRENT_DATE, birthdate) returns the elapsed interval since birth, and DATE_PART(‘year’, …) extracts the total number of full years.
Calculating Duration from Birthdate to a Reference Date
Interval arithmetic is often used to determine time to a fixed milestone (e.g., 18th or 65th birthday).
-- Time until 65th birthday for each person
SELECT
id_text,
birthdate,
(birthdate + INTERVAL '65 years') AS date_turns_65,
AGE((birthdate + INTERVAL '65 years'), CURRENT_DATE) AS time_until_65
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
ORDER BY date_turns_65
LIMIT 15;| id_text | birthdate | date_turns_65 | time_until_65 |
|---|---|---|---|
| Tamb|Bali|1953-05-10 | 1953-05-10 | 2018-05-10 | -7 years -6 mons -5 days |
| Tamb|Bali|1953-05-10 | 1953-05-10 | 2018-05-10 | -7 years -6 mons -5 days |
| Tamb|Bali|1953-05-10 | 1953-05-10 | 2018-05-10 | -7 years -6 mons -5 days |
| Tamb|Bali|1953-05-10 | 1953-05-10 | 2018-05-10 | -7 years -6 mons -5 days |
| Home|Quiñ|1955-10-26 | 1955-10-26 | 2020-10-26 | -5 years -20 days |
| Home|Quiñ|1955-10-26 | 1955-10-26 | 2020-10-26 | -5 years -20 days |
| Home|Quiñ|1955-10-26 | 1955-10-26 | 2020-10-26 | -5 years -20 days |
| Home|Quiñ|1955-10-26 | 1955-10-26 | 2020-10-26 | -5 years -20 days |
| Mari|Lind|1966-06-23 | 1966-06-23 | 2031-06-23 | 5 years 7 mons 8 days |
| Mari|Lind|1966-06-23 | 1966-06-23 | 2031-06-23 | 5 years 7 mons 8 days |
By adding intervals to birthdate, you can project future milestones, calculate eligibility periods, or align with policy cutoffs (e.g., retirement age).
Detect Implausible or Anomalous Ages
Identifying outliers helps ensure that demographic insights are grounded in reality.
WITH ages AS (
SELECT
id_text,
DATE_PART('year', AGE(CURRENT_DATE, birthdate)) AS age_years
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
)
SELECT
COUNT(*) FILTER (WHERE age_years < 0) AS negative_age,
COUNT(*) FILTER (WHERE age_years > 100) AS unrealistic_age,
ROUND(AVG(age_years)::numeric, 1) AS avg_age
FROM ages;| negative_age | unrealistic_age | avg_age |
|---|---|---|
| 0 | 0 | 47.3 |
This check ensures no impossible ages (negative or over 100 years) are present. Such records may indicate data-entry or unit conversion issues.
Temporal Trends by Birth Year or Decade Aggregating by birth year or decade reveals how age distributions evolve over time.
-- Group individuals by birth year and decade
SELECT
FLOOR(EXTRACT(YEAR FROM birthdate) / 10) * 10 AS birth_decade,
ROUND(CAST(AVG(DATE_PART('year', AGE(CURRENT_DATE, birthdate))) AS NUMERIC), 1) AS avg_age,
COUNT(*) AS total_individuals
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
GROUP BY birth_decade
ORDER BY birth_decade;| birth_decade | avg_age | total_individuals |
|---|---|---|
| 1950 | 71 | 8 |
| 1960 | 59 | 4 |
| 1980 | 41 | 4 |
| 2000 | 21 | 8 |
EXTRACT(YEAR …) pulls the numeric birth year.
Decade grouping (FLOOR(year / 10) * 10) is helpful for longitudinal or generational reporting.
Visualizing Age Distribution Over Time
-- Average current age by birth decade
SELECT
FLOOR(EXTRACT(YEAR FROM birthdate) / 10) * 10 AS birth_decade,
ROUND(CAST(AVG(DATE_PART('year', AGE(CURRENT_DATE, birthdate))) AS NUMERIC), 1) AS avg_age,
COUNT(*) AS total_individuals
FROM public.visit_hospital
WHERE birthdate IS NOT NULL
GROUP BY birth_decade
ORDER BY birth_decade;| birth_decade | avg_age | total_individuals |
|---|---|---|
| 1950 | 71 | 8 |
| 1960 | 59 | 4 |
| 1980 | 41 | 4 |
| 2000 | 21 | 8 |
This provides an age summary across decades — a compact representation of how age structure shifts over time.
Part VII — Descriptive Statistics & Tabulation
- Aggregations &
GROUP BY;HAVINGfilters - Ordered-set aggregates (percentiles)
- Crosstab/pivot with
tablefuncextension
Once the data is clean and structured, the next step is summarization and tabulation — transforming granular observations into interpretable statistics.
This section demonstrates descriptive techniques such as aggregation with GROUP BY, conditional filtering with HAVING, calculating percentiles, and reshaping summaries into cross-tabulated (pivoted) formats.
All examples below use the public.global_health_statistics dataset.
1️⃣ Aggregation with GROUP BY
Aggregations condense large datasets into concise summaries — totals, averages, and counts by meaningful categories. The GROUP BY clause in SQL is the equivalent of group_by() + summarise() in R.
In data analytics, this is how we turn millions of records into tables that inform decision-making.
In the example below, the query groups data by both country and disease, and computes three key metrics:
- The average number of people affected per record,
- The total number of people affected (overall disease burden),
- The number of records contributing to each group. The example below shows how
GROUP BYcan be used to generate aggregated insights across multiple dimensions, revealing where disease burdens are highest.
-- Average population affected by disease and country
SELECT
country,
disease_name,
ROUND(AVG(population_affected)::numeric, 0) AS avg_population_affected,
SUM(population_affected) AS total_population_affected,
COUNT(*) AS records_count
FROM public.global_health_statistics
GROUP BY country, disease_name
ORDER BY total_population_affected DESC
LIMIT 15;| country | disease_name | avg_population_affected | total_population_affected | records_count |
|---|---|---|---|---|
| Germany | Tuberculosis | 994653 | 24866314 | 25 |
| Mexico | Ebola | 994470 | 24861757 | 25 |
| Turkey | Dengue | 994385 | 24859631 | 25 |
| Argentina | Ebola | 994191 | 24854765 | 25 |
| France | Hepatitis | 994065 | 24851614 | 25 |
| Argentina | Diabetes | 993911 | 24847772 | 25 |
| Mexico | Rabies | 993845 | 24846123 | 25 |
| South Korea | Parkinson’s Disease | 993740 | 24843491 | 25 |
| Turkey | HIV/AIDS | 993719 | 24842966 | 25 |
| Italy | Polio | 993676 | 24841908 | 25 |
Context: This type of aggregation could support a quick ranking of health priorities — for instance, identifying which diseases have historically affected the largest populations within each country. Using both AVG() and SUM() together allows you to distinguish between typical impact per observation and overall cumulative impact. The ORDER BY ensures that the most significant results (in terms of total affected population) appear at the top.
2️⃣HAVING for Filtering Aggregated Results
After summarizing, it’s often useful to narrow down results to only those groups that meet specific conditions — for example, diseases with exceptionally high case counts. Unlike WHERE, which filters individual rows before aggregation, HAVING filters aggregated groups after summary calculations.
The example below shows how to isolate high-impact diseases affecting over one million people in total.
-- Show only diseases with more than 1 million people affected
SELECT
disease_name,
SUM(population_affected) AS total_population
FROM public.global_health_statistics
GROUP BY disease_name
HAVING SUM(population_affected) > 1000000
ORDER BY total_population DESC
LIMIT 10;| disease_name | total_population |
|---|---|
| Polio | 495323206 |
| Tuberculosis | 495320883 |
| Parkinson’s Disease | 495304287 |
| Ebola | 495254480 |
| Hypertension | 495155070 |
| Cholera | 495153078 |
| Alzheimer’s Disease | 495140228 |
| Zika | 495121139 |
| COVID-19 | 495111737 |
| Cancer | 495100583 |
Here, HAVING ensures we only report diseases where the cumulative affected population exceeds 1 million. Context: This query is useful when performing threshold-based analysis — such as identifying key health challenges that exceed certain severity levels. It could easily be adapted for different analytical questions, such as finding diseases with a rapid increase over time or regions exceeding target incidence rates.
3️⃣ Ordered-Set Aggregates: Percentiles & Medians
While aggregates like AVG() provide a measure of central tendency, percentiles and medians describe distribution and spread. They are especially useful when data contains outliers — for example, one extreme outbreak could distort the average, while the median gives a more typical view.
The example below shows how to compute quartiles (25th, 50th, 75th percentiles) of population affected for each disease.
-- Compute median and quartiles of affected population by disease
SELECT
disease_name,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY population_affected) AS q1,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY population_affected) AS median,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY population_affected) AS q3
FROM public.global_health_statistics
WHERE population_affected IS NOT NULL
GROUP BY disease_name
ORDER BY median DESC
LIMIT 10;| disease_name | q1 | median | q3 |
|---|---|---|---|
| Zika | 986952.5 | 993871.0 | 997424.8 |
| COVID-19 | 986470.8 | 993712.0 | 996905.0 |
| Parkinson’s Disease | 986463.0 | 993632.5 | 997368.5 |
| Cholera | 987292.5 | 993503.5 | 997038.0 |
| Polio | 987037.5 | 993333.5 | 996737.0 |
| Measles | 985969.0 | 993319.5 | 997046.5 |
| Leprosy | 985762.0 | 993277.5 | 997048.2 |
| Diabetes | 987171.0 | 993161.0 | 997301.8 |
| Tuberculosis | 987507.0 | 993129.5 | 997328.0 |
| Ebola | 986789.0 | 993120.5 | 997368.8 |
Context: Percentiles help identify which diseases have the widest or most concentrated spread of population impact. For instance, a disease with a large gap between the 25th and 75th percentile likely affects different regions very unevenly — an insight valuable for targeted interventions.
4️⃣Reshaping Data: Cross-Tabulation (Pivoting Data)
Data reshaping refers to transforming the structure of a dataset — switching between long and wide formats depending on analytical or reporting needs.
In long format, each observation occupies one row (e.g., each record represents a country–year–disease combination).
In wide format, a single row may represent a country or year, and multiple variables or categories become columns (e.g., diseases side by side).
Reshaping is a key part of preparing data for visualizations, dashboards, and comparative summaries.
While SQL is inherently row-oriented, PostgreSQL’s tablefunc extension and some creative joins make reshaping feasible directly within SQL.
These transformations are common in epidemiology, public health reporting, and data science workflows: - Analysts prefer long format for modeling and statistical analysis.
- Decision-makers and stakeholders often prefer wide format for dashboards, comparisons, and quick scanning.
The following examples demonstrate both wide and long reshaping operations using the global_health_statistics dataset.
4.1: Reshaping to Wide Format
Pivoting data turns long-format tables into wide-format summaries, similar to Excel pivot tables or R’s pivot_wider(). This is particularly useful for comparing values side by side — such as annual totals per disease.
In PostgreSQL, pivoting can be done using the crosstab() function from the tablefunc extension.
Why important: Stakeholders often ask for “diseases across columns by year.” The example below shows how I pivot totals by year into a wide table for quick scanning.
-- Enable the extension (run once per database)
CREATE EXTENSION IF NOT EXISTS tablefunc;SELECT *
FROM crosstab(
$$
SELECT year_event::INT, disease_name, SUM(population_affected)::NUMERIC
FROM public.global_health_statistics
GROUP BY year_event, disease_name
ORDER BY year_event, disease_name
$$
) AS ct(
year_event INT,
malaria NUMERIC,
dengue NUMERIC,
influenza NUMERIC,
covid19 NUMERIC
)
LIMIT 10;| year_event | malaria | dengue | influenza | covid19 |
|---|---|---|---|---|
| 2000 | 19835840 | 19806950 | 19801964 | 19865735 |
| 2001 | 19794559 | 19817181 | 19824040 | 19822067 |
| 2002 | 19836984 | 19816082 | 19759866 | 19784016 |
| 2003 | 19773534 | 19704438 | 19764751 | 19803744 |
| 2004 | 19792759 | 19825219 | 19778123 | 19787008 |
| 2005 | 19850519 | 19787102 | 19870901 | 19746345 |
| 2006 | 19846542 | 19769208 | 19814068 | 19747205 |
| 2007 | 19847607 | 19821630 | 19766503 | 19760524 |
| 2008 | 19735264 | 19848897 | 19823537 | 19756148 |
| 2009 | 19718827 | 19763020 | 19770930 | 19802159 |
Context: This example shows how a long transactional dataset can be reshaped into a compact, comparative view where each disease becomes a column. Such output is ideal for visualizing trends or preparing data for dashboards (e.g., year-over-year change in disease prevalence).
4.2: Reshaping to Long Format
Long format is often preferred for modeling, analytics, and machine learning pipelines because it’s tidy — every variable has its own column, and every observation has its own row. When you receive data in a wide structure (for example, one column per disease), converting it back to long format makes it easier to aggregate, filter, or compute metrics by category.
Why this matters: Analytical workflows (e.g., regression, forecasting, clustering) often expect a normalized long structure. This format aligns better with relational database principles and is easier to extend when new diseases or measures appear.
In PostgreSQL, this can be achieved with UNION ALL or the more compact UNNEST() approach.
SELECT year_event, disease_name, population_affected
FROM (
SELECT *
FROM crosstab(
$$
SELECT year_event::INT, disease_name, SUM(population_affected)::NUMERIC
FROM public.global_health_statistics
GROUP BY year_event, disease_name
ORDER BY year_event, disease_name
$$
) AS ct(
year_event INT,
malaria NUMERIC,
dengue NUMERIC,
influenza NUMERIC,
covid19 NUMERIC
)
) crosstab_result
CROSS JOIN LATERAL unnest(
ARRAY['malaria','dengue','influenza','covid19'],
ARRAY[malaria,dengue,influenza,covid19]
) AS t(disease_name, population_affected)
ORDER BY year_event, disease_name;| year_event | disease_name | population_affected |
|---|---|---|
| 2000 | covid19 | 19865735 |
| 2000 | dengue | 19806950 |
| 2000 | influenza | 19801964 |
| 2000 | malaria | 19835840 |
| 2001 | covid19 | 19822067 |
| 2001 | dengue | 19817181 |
| 2001 | influenza | 19824040 |
| 2001 | malaria | 19794559 |
| 2002 | covid19 | 19784016 |
| 2002 | dengue | 19816082 |
ARRAY[…] defines disease names.
ARRAY[…] holds the corresponding values from each column.
UNNEST expands them into long rows.
Another option to achieve the same result could be using UNION ALL:
Below is a demonstration using UNION ALL to manually “unpivot” wide data into long form.
-- Example: manually reshaping a wide table back to long format
-- (assuming we already have a pivoted summary table)
WITH disease_wide AS (
SELECT *
FROM crosstab(
$$
SELECT year_event::INT, disease_name, SUM(population_affected)::NUMERIC
FROM public.global_health_statistics
GROUP BY year_event, disease_name
ORDER BY year_event, disease_name
$$
) AS ct(
year_event INT,
malaria NUMERIC,
dengue NUMERIC,
influenza NUMERIC,
covid19 NUMERIC
)
)
SELECT year_event, 'malaria' AS disease_name, malaria AS population_affected FROM disease_wide
UNION ALL
SELECT year_event, 'dengue', dengue FROM disease_wide
UNION ALL
SELECT year_event, 'influenza', influenza FROM disease_wide
UNION ALL
SELECT year_event, 'covid19', covid19 FROM disease_wide
ORDER BY year_event, disease_name
LIMIT 20;| year_event | disease_name | population_affected |
|---|---|---|
| 2000 | covid19 | 19865735 |
| 2000 | dengue | 19806950 |
| 2000 | influenza | 19801964 |
| 2000 | malaria | 19835840 |
| 2001 | covid19 | 19822067 |
| 2001 | dengue | 19817181 |
| 2001 | influenza | 19824040 |
| 2001 | malaria | 19794559 |
| 2002 | covid19 | 19784016 |
| 2002 | dengue | 19816082 |
5️⃣ Multi-Level Summaries
When you want to track patterns over time, it’s helpful to summarize data at multiple hierarchical levels. The query below aggregates the total population affected by year and country, producing a panel-like view suitable for temporal trend analysis.
-- Yearly total affected population per country
SELECT
country,
year_event,
SUM(population_affected) AS total_affected
FROM public.global_health_statistics
GROUP BY country, year_event
ORDER BY country, year_event;| country | year_event | total_affected |
|---|---|---|
| Argentina | 2000 | 19817875 |
| Argentina | 2001 | 19835072 |
| Argentina | 2002 | 19839406 |
| Argentina | 2003 | 19825531 |
| Argentina | 2004 | 19868546 |
| Argentina | 2005 | 19803811 |
| Argentina | 2006 | 19873224 |
| Argentina | 2007 | 19812365 |
| Argentina | 2008 | 19796552 |
| Argentina | 2009 | 19806823 |
Context: This example demonstrates how a simple GROUP BY over multiple dimensions can reveal longitudinal trends — such as whether a country’s total burden is increasing or decreasing over time. The resulting dataset could serve as the basis for line charts or heatmaps in further visualization steps.
🔎 Summary Table: Descriptive Statistics & Reshaping Techniques
| Technique | Purpose | SQL Example | When to Use |
|---|---|---|---|
GROUP BY + aggregates (SUM, AVG, COUNT) |
Collapse raw rows into grouped summaries | SELECT country, disease_name, SUM(population_affected) FROM ... GROUP BY country, disease_name; |
Rank burdens by country/disease, create quick summaries |
HAVING filters |
Filter groups after aggregation | ... GROUP BY disease_name HAVING SUM(population_affected) > 1000000; |
Identify only high-impact diseases or threshold exceedances |
Ordered-set aggregates (PERCENTILE_CONT) |
Compute medians, quartiles, or percentiles | SELECT disease_name, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY population_affected) FROM ... GROUP BY disease_name; |
Understand distributions, reduce impact of outliers |
crosstab() (wide pivot) |
Reshape long data into wide format (like Excel pivot tables) | SELECT * FROM crosstab($$SELECT year_event, disease_name, SUM(population_affected) ...$$) AS ct(...); |
Dashboards, year-over-year comparisons, stakeholder summaries |
UNNEST() / UNION ALL (long unpivot) |
Convert wide back to long/tidy format | SELECT year_event, disease_name, population_affected FROM wide CROSS JOIN LATERAL unnest(...); |
Feed into modeling, visualization, machine learning |
| Multi-level summaries | Aggregate across multiple hierarchical dimensions | SELECT country, year_event, SUM(population_affected) FROM ... GROUP BY country, year_event; |
Trend analysis across time and geography |
Key Insight:
- Wide format is great for comparative dashboards, stakeholder reporting, and scanning.
- Long format is better for tidy analysis, visualisation, and interoperability with tools like R or Python.
- SQL’s flexibility (GROUP BY, HAVING, PERCENTILE_CONT, crosstab, UNNEST) allows analysts to move fluidly between exploratory statistics and presentation-ready summaries.
Key Takeaways
Descriptive SQL transforms raw transactional data into interpretable insights through summarization and restructuring.
The ability to move fluidly between long and wide formats reflects practical analytical flexibility — catering to both technical and decision-maker audiences.
Part VIII — Performance Basics
Database performance tuning is an essential skill when working with large datasets such as public.global_health_statistics. Even seemingly simple queries can become very slow if the database needs to scan millions of rows. This section demonstrates three core performance techniques — indexing, reading query plans, and simple tuning habits — using examples from the dataset. Performance optimization in SQL typically involves:
- Creating the right indexes
- Understanding how queries are executed with
EXPLAIN ANALYZE
- Applying simple tuning strategies to improve speed
1. Indexing Strategies
SQL databases use B-tree indexes by default. They work like a sorted lookup structure that makes finding rows faster.
When to use indexes: - On filtering columns (WHERE) - On grouping keys (GROUP BY) - On ordering columns (ORDER BY) - On join keys between tables
Indexes act like a book’s table of contents: instead of flipping through every page, the database can jump directly to the rows it needs. For instance:
-- Without an index: PostgreSQL will likely do a sequential scan
EXPLAIN ANALYZE
SELECT *
FROM public.global_health_statistics
WHERE country = 'Australia';| QUERY PLAN |
|---|
| Bitmap Heap Scan on global_health_statistics (cost=8.16..1796.94 rows=500 width=168) (actual time=0.143..0.442 rows=500 loops=1) |
| Recheck Cond: (country = ‘Australia’::text) |
| Heap Blocks: exact=497 |
| -> Bitmap Index Scan on idx_ghs_country (cost=0.00..8.04 rows=500 width=0) (actual time=0.091..0.091 rows=500 loops=1) |
| Index Cond: (country = ‘Australia’::text) |
| Planning Time: 0.077 ms |
| Execution Time: 0.467 ms |
Indexes speed up queries by allowing the database to look up rows efficiently rather than scanning the entire table.
CREATE INDEX IF NOT EXISTS idx_ghs_country ON public.global_health_statistics (country);
– Re-run the query EXPLAIN ANALYZE SELECT * FROM public.global_health_statistics WHERE country = ‘Australia’;
With the index in place, the plan switches to an Index Scan, reducing I/O dramatically. On large datasets, this difference is measured in seconds or minutes saved e.g., from 2.9 micro seconds to 0.88 micro seconds in this example (see the screenshot below).
The following screenshot shows the execution time difference observed before and after adding an index:
2. Reading Query Plans with EXPLAIN ANALYZE
Indexes are only useful if the database actually uses them. That’s where EXPLAIN ANALYZE comes in. It shows the chosen plan and the actual execution time.
– Create index on year_event CREATE INDEX IF NOT EXISTS idx_ghs_year_event ON public.global_health_statistics (year_event);
– Re-run the query to see faster plan EXPLAIN ANALYZE SELECT year_event, COUNT(*) AS n_records FROM public.global_health_statistics GROUP BY year_event ORDER BY year_event;
The output explains whether PostgreSQL used sorting, hashing, or index scans. If grouping by year is frequent, adding an index on year may shift the plan from expensive full sorts to more efficient operations.
3. Simple Tuning Tips
Even without adding new indexes, developers can make queries run faster by following small habits:
a) Predicates: Make Them SARGable
SARGable = Search Argument Able. Avoid wrapping columns in functions.
❌ Non-SARGable
Good (SARGable)
WHERE mortality_pct = 5.0
WHERE year = 2020
b) Projections: Select Only What You Need
– SELECT * – FROM public.global_health_statistics;
SELECT country, year_event, disease_name, mortality_pct
FROM public.global_health_statistics
LIMIT 20;| country | year_event | disease_name | mortality_pct |
|---|---|---|---|
| Nigeria | 2013 | Asthma | 9.15 |
| Turkey | 2010 | Tuberculosis | 2.75 |
| Argentina | 2006 | Ebola | 3.16 |
| Saudi Arabia | 2013 | Hypertension | 5.57 |
| Nigeria | 2004 | COVID-19 | 8.77 |
| Russia | 2012 | Parkinson’s Disease | 6.85 |
| Saudi Arabia | 2009 | Cancer | 1.58 |
| Russia | 2009 | Ebola | 4.46 |
| Nigeria | 2022 | Polio | 1.13 |
| India | 2016 | Leprosy | 8.48 |
c) Ordering: Use Index Support If you frequently order by columns like population_affected, consider an index.
CREATE INDEX idx_population_affected ON public.global_health_statistics(population_affected) LIMIT 15;
SELECT country, population_affected FROM public.global_health_statistics ORDER BY population_affected DESC LIMIT 15;
Conclusion
With even these basic optimizations, you can significantly improve query performance:
Indexing frequently used filter, grouping, and ordering columns (year_event, country, disease_name, population_affected).
Reading query plans with EXPLAIN ANALYZE to understand execution.
Tuning queries by making predicates SARGable, projecting only needed columns, and leveraging indexes for ordering.
These practices ensure your queries on global_health_statistics remain efficient even as the dataset grows.
Part IX — Security, Backup & Sharing
Ensuring data integrity and accessibility means thinking beyond querying. This section covers how to:
- Secure access with roles & privileges
- Create backups and restore databases
- Export and share query results for reporting
1. Roles & Privileges — Granting Read-Only Access to Learners
PostgreSQL supports fine-grained control over who can connect to databases and what they can do inside.
Give Learners Read-Only Access to Global Health Data
– Identify the Database name SELECT datname FROM pg_database;
A. Create the Role for the Learner
DO $$
BEGIN
IF NOT EXISTS (
SELECT FROM pg_roles WHERE rolname = 'learner_user'
) THEN
CREATE ROLE learner_user WITH LOGIN PASSWORD 'SafePass123';
END IF;
END $$;B. Grant Database Access
-- Allow learner_user to connect to the specific database eg SQL_Data_Manipulation database
GRANT CONNECT ON DATABASE "SQL_Data_Manipulation" TO learner_user;C. Grant Schema-Level Access
-- Allow the user to access objects in the 'public' schema
GRANT USAGE ON SCHEMA public TO learner_user;D. Grant Read Access to Specific Tables
-- Allow SELECT queries only
GRANT SELECT ON public.global_health_statistics TO learner_user;Now, learner_user can log in, browse the schema, and query the dataset — but cannot modify anything.
Create Multiple Learners with a Group Role Instead of managing access per individual, create a shared group role and assign learners to it.
-- Create the role group
DO $$
BEGIN
IF NOT EXISTS (
SELECT FROM pg_roles WHERE rolname = 'learners_group'
) THEN
CREATE ROLE learners_group;
END IF;
END $$;-- Create individual users
DO $$
BEGIN
IF NOT EXISTS (
SELECT FROM pg_roles WHERE rolname = 'learner_ali'
) THEN
CREATE ROLE learner_ali WITH LOGIN PASSWORD 'Ali123';
END IF;
END $$;DO $$
BEGIN
IF NOT EXISTS (
SELECT FROM pg_roles WHERE rolname = 'learner_sara'
) THEN
CREATE ROLE learner_sara WITH LOGIN PASSWORD 'Sara123';
END IF;
END $$;GRANT learners_group TO learner_sara;-- Grant group access to the database
GRANT CONNECT ON DATABASE "SQL_Data_Manipulation" TO learners_group;GRANT USAGE ON SCHEMA public TO learners_group;GRANT SELECT ON public.global_health_statistics TO learners_group;Audit Who Has Access
SELECT rolname
FROM pg_roles
WHERE has_database_privilege(rolname, 'SQL_Data_Manipulation', 'CONNECT');| rolname |
|---|
| pg_database_owner |
| pg_read_all_data |
| pg_write_all_data |
| pg_monitor |
| pg_read_all_settings |
| pg_read_all_stats |
| pg_stat_scan_tables |
| pg_read_server_files |
| pg_write_server_files |
| pg_execute_server_program |
To see which roles have SELECT access on global_health_statistics:
SELECT grantee, privilege_type
FROM information_schema.role_table_grants
WHERE table_name = 'global_health_statistics';| grantee | privilege_type |
|---|---|
| postgres | TRIGGER |
| postgres | REFERENCES |
| postgres | TRUNCATE |
| postgres | DELETE |
| postgres | UPDATE |
| postgres | SELECT |
| postgres | INSERT |
| learner_user | SELECT |
| learners_group | SELECT |
2. Backups and Restores
🔸 Using pgAdmin (GUI)
To Backup:
Right-click on SQL_Data_Manipulation
Click Backup
Choose format: Custom
Save the file (.backup)
To Restore:
Create a new empty database (e.g., SQL_Health_Restore)
Right-click → Restore
Load the .backup file
3. Exporting Results and Reporting with Quarto
Export to CSV using psql
-- **Export to CSV using psql**
COPY (
SELECT disease_name, year_event, AVG(mortality_pct) AS avg_mortality
FROM public.global_health_statistics
GROUP BY disease_name, year_event
) TO 'C:\\temp\\disease_mortality.csv' CSV HEADER;B. Render a Quarto Table or Chart
SELECT
disease_name,
ROUND(AVG(mortality_pct)::numeric, 2) AS avg_mortality
FROM public.global_health_statistics
GROUP BY disease_name
ORDER BY avg_mortality DESC
LIMIT 10;| disease_name | avg_mortality |
|---|---|
| Malaria | 5.20 |
| Ebola | 5.09 |
| Cholera | 5.08 |
| Tuberculosis | 5.06 |
| Parkinson’s Disease | 5.06 |
| COVID-19 | 5.05 |
| Hepatitis | 5.05 |
| Diabetes | 5.04 |
| Influenza | 5.03 |
| Measles | 5.02 |
df <- read_csv("C:/temp/disease_mortality.csv")Rows: 500 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): disease_name
dbl (2): year_event, avg_mortality
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(df, aes(x = reorder(disease_name, avg_mortality), y = avg_mortality)) +
geom_col(fill = "#0077cc") +
coord_flip() +
labs(title = "Top 10 Diseases by Average Mortality",
x = "Disease", y = "Mortality %")Database Administration & Reporting Tasks
| Task | Tool | Notes |
|---|---|---|
| Grant DB Access | SQL | GRANT CONNECT, GRANT SELECT |
| Bulk Learner Setup | SQL + Group Role | Use learners_group |
| Backup | pgAdmin / pg_dump |
Choose Custom format |
| Restore | pgAdmin / pg_restore |
Create empty DB first |
| Export | \COPY in psql |
Save results as CSV |
| Reporting | Quarto + ggplot | Use dynamic SQL chunks |
By combining PostgreSQL’s security model, backup features, and Quarto’s reporting power, you’re equipped to build a secure, scalable, and sharable data analysis environment.
📝 Summary and Conclusion
This presentation provided an end-to-end journey through PostgreSQL on Windows, covering everything from installation to advanced analytics and reporting.
We began with the foundations of SQL (ACID principles, PostgreSQL ecosystem, pgAdmin/psql), then demonstrated data loading and exploration. From there, we explored both core wrangling (SELECT, WHERE, JOIN, CASE) and advanced techniques (CTEs, window functions, deduplication, upserts).
The middle sections focused on descriptive statistics (aggregation, percentiles, crosstabs) and data quality (missing data handling, casting/cleaning, working with dates).
Next, we examined performance basics (indexing, query plans, tuning practices) to ensure queries run efficiently even on large health datasets.
Finally, we addressed the critical topics of security, backup, and sharing, demonstrating how to manage roles and privileges, perform reliable backups, and integrate results with Quarto for transparent reporting.
🎯 Key Takeaways
- SQL is more than queries — it is a full environment for storing, securing, analyzing, and sharing data.
- PostgreSQL + pgAdmin provide a robust platform for analytics work.
- Integration with R and Quarto allows reproducible, professional-grade reporting.
- Data governance practices (indexes, roles, backups) are as important as analysis techniques.
By combining technical SQL skills with performance, governance, and reproducibility, analysts can deliver results that are accurate, efficient, and trusted.
📚 References & Learning Resources
For further study and practice, here are recommended books and websites
Books - Beaulieu, A. (2020). Learning SQL: Generate, manipulate, and retrieve data (3rd ed.). O’Reilly Media. ISBN: 978-1492057611
Obe, R. O., & Hsu, L. S. (2017). PostgreSQL: Up and running: A practical guide to the advanced open source database (3rd ed.). O’Reilly Media. ISBN: 978-1491963418
DeBarros, A. (2018). Practical SQL: A beginner’s guide to storytelling with data. No Starch Press. ISBN: 978-1593278274
Websites & Documentation - PostgreSQL Official Documentation
- Postgres Tutorial
- Mode Analytics SQL Tutorial
- RStudio Quarto Documentation