id birthdate deathdate ssn
1 898a6256-7ffb-dbe2-e24d-12fda2fedcfd 1966-06-23 999-98-5513
2 5e0a6984-38d7-c604-f55f-f1de5e933768 1986-07-04 999-58-5560
3 e335de09-0994-4111-3c15-6edcc17ae4bc 2005-04-25 999-48-3846
4 45b89342-dc05-8e57-8eee-9ed68ec42378 1984-08-24 999-40-3433
5 916b1ac8-56c8-ec1b-3b9a-721336a74912 1953-05-10 999-16-3735
6 dc323bce-e583-d903-303e-9c865bc87e67 2012-09-02 999-90-3605
drivers passport prefix first middle last suffix
1 S99971131 X62454905X Mrs. Maris768 Lindgren255 NA
2 S99931679 X61479569X Mrs. Hattie299 Reatha769 Nader710 NA
3 S99958413 Ms. Jimmie93 Towanda270 Doyle959 NA
4 S99985178 X27670495X Mr. Dwight645 Marlin805 Hilll811 NA
5 S99968840 X83646024X Ms. Tambra47 Brittni468 Balistreri607 NA
6 Carson894 Raphael767 Littel644 NA
maiden marital race ethnicity gender birthplace
1 Dach178 D white nonhispanic F Oxford Massachusetts US
2 Harber290 M white nonhispanic F North Reading Massachusetts US
3 white nonhispanic F Boston Massachusetts US
4 M white nonhispanic M Framingham Massachusetts US
5 S white nonhispanic F Lexington Massachusetts US
6 white nonhispanic M Plymouth Massachusetts US
address city state county
1 962 Stiedemann Vista Aquinnah Massachusetts Dukes County
2 720 O'Keefe Arcade Newton Massachusetts Middlesex County
3 493 Cruickshank Mission Holliston Massachusetts Middlesex County
4 810 Wolf Pathway Malden Massachusetts Middlesex County
5 146 Rowe Village Suite 3 New Bedford Massachusetts Bristol County
6 467 Collier Stravenue Unit 12 Attleboro Massachusetts Bristol County
fips zip lat lon healthcare_expenses healthcare_coverage income
1 NA 0 41.29618 -70.77458 343608.18 884243.46 178323
2 25017 2468 42.29180 -71.14930 169293.06 801407.90 86384
3 NA 0 42.24196 -71.46440 64178.99 5906.14 198522
4 25017 2148 42.43610 -71.14328 36010.63 472861.90 25909
5 25005 2743 41.69576 -71.01391 281198.90 310261.21 140772
6 44007 2861 41.94870 -71.30420 45972.75 0.00 71169
visit
1 1
2 1
3 1
4 1
5 1
6 1
Introduction
Missing data is a common challenge in survey, clinical, and population-based datasets.
It can arise from participant dropout, skipped responses, changes in study design, or inconsistencies in data collection.
If not addressed appropriately, missingness can introduce bias, reduce statistical power, and undermine the reliability of results.
I also prepared Missing Data Management and Analysis via PowerPoint: Visit the Missing data presentation.
This work showcases two complementary missing data management projects, developed across different datasets and research contexts:
- Stata-based missing data diagnostics and imputation
- Used structured diagnostic tools to quantify and visualise missingness, and identify patterns.
- Applied targeted imputation methods, including Multiple Imputation (MI), Last Observation Carried Forward (LOCF), Next Observation Carried Backward (NOCB), and hybrid approaches depending on variable type and study design.
- Used structured diagnostic tools to quantify and visualise missingness, and identify patterns.
- R-based reproducible workflows for missing data
- Designed flexible pipelines for detecting, cleaning, and imputing missing values.
- Demonstrated simple (mean/median) and advanced methods (MICE, model-based imputation), as well as forward/backward filling and participant-level imputation.
By combining statistical rigour, methodological adaptability, and transparent documentation, these approaches ensure that findings remain robust, reproducible, and comparable across studies, regardless of the data source.
STATA-Based Missing Data management and Analysis Project
Some outputs are not presented here, as certain variables had small cell counts that could potentially compromise participant confidentiality.
Where the number of observations fell below accepted thresholds, they were withheld to remain in line with the minimum data reporting standards for sensitive datasets.
The overall and aggregated results, however, have been published and are available in the following peer-reviewed journal articles:
My work has resulted in three peer-reviewed Q1 publications:
- Preterm Birth: https://doi.org/10.1016/j.midw.2022.103334
- Low Birthweight: https://doi.org/10.1093/eurpub/ckab033
- Caesarean Section and Labour Intervention: https://doi.org/10.1007/s43032-023-01219-7
In large-scale and longitudinal data, missing data is almost inevitable. It can arise from participant attrition, skipped responses, evolving survey designs, or inconsistencies in measurement over time.
If not addressed appropriately, missingness can reduce statistical power, bias estimates, and compromise the validity of conclusions.
1.1 Scientific Context
Longitudinal studies, such as ALSWH, collect data from the same participants repeatedly over long periods. While this design allows for rich analysis of change over time, it is also prone to loss to follow-up and intermittent missingness.
International best practice suggests that when missingness exceeds 5–10% for a variable, imputation should be considered to improve precision and reduce bias (Rubin, 1987; Sterne et al., 2009). In this project:
- Maternal pre-pregnancy BMI and age at menarche had missingness above 10%.
- Missingness was often non-monotone, requiring tailored imputation strategies for different variable types.
1.2 Diagnostic Approaches
I began by systematically assessing the extent and structure of missingness:
- Used STATA packages
misschk,mdesc, andxtpatternvarto quantify missing values per variable and identify patterns. - Classified patterns into monotone (e.g., dropout) and intermittent (e.g., missed responses but later re-entry).
- Inspected missingness by birth order and survey wave to account for pregnancy-specific variables.
- Flagged variables with >5% missingness for potential imputation.
1.3 Missing Data Mechanisms
Before selecting imputation methods, I classified variables according to the statistical mechanism of missingness:
- MCAR – Missing Completely at Random: The probability of a value being missing is unrelated to either the observed or unobserved data.
- MAR – Missing at Random: The probability of missingness depends only on observed data, not the missing values themselves.
- MNAR – Missing Not at Random: The probability of missingness depends on the unobserved data itself.
1.4 Criteria for Imputation
A rule of thumb from the literature recommends initiating imputation when the missing proportion exceeds 5–10% of the variable’s observations. In this study:
- Maternal BMI and age at menarche exceeded 10% missingness.
- Time-varying health conditions (e.g., hypertension) had intermittent missingness between waves.
1.5 Imputation Methods Applied Using STATA
This Project has been completed using ALSWH data source
a) Mean Imputation (Row Mean)
- Applied to pre-pregnancy BMI (<15% missingness).
- Used the participant’s own available BMI values from other waves to calculate a personalised row mean.
- Observed mean: 24.46 kg/m²; Imputed mean: 24.63 kg/m² — indicating close alignment.
b) Multiple Imputation (MI)
- Applied to age at menarche (>10% missingness), a static variable recorded only once (Survey 2).
- Used Multivariate Imputation by Chained Equations (MICE) with 20 imputations, following guidelines to reduce sampling error.
- Incorporated auxiliary variables predictive of missingness.
- Conducted diagnostics: density plots, distribution overlays, and observed vs. imputed comparison.
c) Last Observation Carried Forward (LOCF)
- Used for chronic conditions (e.g., hypertension, diabetes) that persist after diagnosis.
- Ensured stability in prevalence estimates over time.
d) Next Observation Carried Backward (NOCB)
- Applied to lifetime exposure variables (e.g., partner violence). If reported in later waves, status was carried backward.
e) Combined LOCF + NOCB
- Applied to variables requiring lifetime ascertainment, maximising completeness in both time directions; for example, country of birth.
1.6 Sensitivity Analyses
To ensure the robustness of results: - Conducted complete-case analysis alongside imputed datasets. - Compared distributions of observed vs. imputed data to detect anomalies. - Re-estimated models with alternative imputation strategies to check stability of effect estimates.
Table: Summary of Missing Data Handling Techniques
| Variable Type | Example Variable | Method Applied | Rationale |
|---|---|---|---|
| Static, continuous | Age at menarche | Multiple Imputation (20 datasets, chained equations) | Single-time measurement; high missingness; categorical imputation adjusted for predictors |
| Time-varying, continuous | BMI prior to birth | Row mean imputation using available values from same participant | Preserves within-person trajectory; missing at intermittent waves |
| Chronic binary condition | Hypertension, diabetes | Last Observation Carried Forward (LOCF) | Lifetime conditions assumed persistent once reported |
| Lifetime event binary | Ever violated by partner | LOCF + Next Observation Carried Backward (NOCB) | Ensures capture of lifetime exposure regardless of reporting wave |
In summary, the combination of diagnostic mapping, mechanism classification, tailored imputation, and sensitivity analyses allowed me to address missingness in a scientifically sound manner, maximising the analytic value of this 20-year longitudinal dataset.
Final Reflection:
I have had the privilege of contributing to one of Australia’s most significant longitudinal studies, producing results that are both methodologically robust and policy-relevant.
The lessons I have gained — about rigour, adaptability, and the potential of data to drive meaningful change — will remain central to my work and collaborations. I will apply these skills in future projects that connects advanced analytics with real-world health outcomes, both in Australia and internationally.
If you are interested in exploring each paper in greater depth, including the methods applied, please read the links below. I hope you find them insightful.
- Preterm Birth: https://doi.org/10.1016/j.midw.2022.103334
- Low Birthweight: https://doi.org/10.1093/eurpub/ckab033
- Caesarean Section and Labour Intervention: https://doi.org/10.1007/s43032-023-01219-7
R-Based Missing Data management and Analysis Project
Missing Value Identification and Imputations
A. Identifying Missing Values
is.na(), anyNA(), sum(is.na(x)), colSums() – detect/count missing values
complete.cases() – filter complete rows
Convert blanks to NA: x[x == ““] <- NA
B. Handling Missing Values
Basic Removal/Replacement: na.omit(), replace_na() (tidyr), ifelse()
Forward/Backward Fill: fill() (tidyr) with .direction = “down” / “up”
Statistical Imputation:
-Simple: mean/median imputation
-Multiple: mice::mice()()
-Advanced: Hmisc::aregImpute, missForest, Amelia, zoo::na.fill
Missing data is a common issue in real-world datasets. In R, handling missing values properly is essential to ensure the integrity of analyses and avoid misleading results.
This section covers how to identify, clean, and impute missing values using base R and tidyverse tools. Techniques range from basic filtering to advanced imputation models.
➡️ Missing values are typically represented as NA in R. Empty strings (e.g., "") or special codes (e.g., -99, "missing") may also indicate missingness and need conversion before analysis.
🔹 Summary of Functions for Handling Missing Values
| Task | Function / Code | Purpose |
|---|---|---|
| Detect missing values | is.na(x), anyNA(x), sum(is.na(x)) |
Identify/count missing values |
| Missing per column | colSums(is.na(df)) |
Count NAs per column |
| Filter complete rows | complete.cases(df) |
Keep only rows without any missing values |
Convert blanks to NA |
x[x == ""] <- NA |
Convert empty strings to missing values |
| Remove missing rows | na.omit(df) |
Remove rows with any missing value |
| Replace missing values | replace_na(df, list(...)), ifelse(is.na(x), ...) |
Impute or replace NA with custom logic |
| Forward/backward fill | fill(df, .direction = "down" / "up") |
Fill NA using nearby values |
| Mean/median imputation | x[is.na(x)] <- mean(x, na.rm = TRUE) |
Replace NA with summary statistic |
| Multiple imputation | mice::mice(df) |
Create multiple imputed datasets using chained equations |
| Advanced imputation | Hmisc::aregImpute, missForest, Amelia, zoo::na.fill() |
Specialized techniques for complex imputations |
📚 Further Learning and Resources
Additional tutorials, demonstrations, and datasets are available on:
YouTube Channel – AnalyticsHub
https://studio.youtube.com/channel/UCLVKP0g8GvHhOh0kn20T8eg/videos/upload?filter=%5B%5D&sort=%7B%22columnType%22%3A%22date%22%2C%22sortOrder%22%3A%22DESCENDING%22%7DGitHub Repository
https://github.com/HabtamuBizuayehu?tab=repositories
# Summarise missing values across all columns
missing_summary <- colSums(is.na(visits_all))
print(missing_summary[missing_summary > 0]) deathdate drivers passport prefix
40 4 12 8
middle suffix maiden marital
4 40 28 16
race fips zip healthcare_coverage
20 12 12 4
income
20
# Total number of missing cells
total_missing <- sum(is.na(visits_all))
cat("Total missing values in dataset:", total_missing, "\n")Total missing values in dataset: 220
# Count of individuals with at least one missing value
n_missing_individuals <- sum(!complete.cases(visits_all))
cat("Number of individuals with at least one missing value:", n_missing_individuals, "\n")Number of individuals with at least one missing value: 40
# Sort records by patient ID and visit number
visits_all <- visits_all %>%
arrange(id_text, visit)
# View rows with any missing values
missing_rows <- visits_all %>%
filter(!complete.cases(.))
head(missing_rows, n = 12) %>% select(id_text, race, income, passport) id_text race income passport
1 Anit|Sánc|1960-05-07 white 61016 X9157439X
2 Anit|Sánc|1960-05-07 <NA> NA X9157439X
3 Anit|Sánc|1960-05-07 white NA X9157439X
4 Anit|Sánc|1960-05-07 <NA> 60016 X9157439X
5 Aure|Weis|2003-11-12 white 8752 X17586296X
6 Aure|Weis|2003-11-12 <NA> NA X17586296X
7 Aure|Weis|2003-11-12 white NA X17586296X
8 Aure|Weis|2003-11-12 <NA> 7752 X17586296X
9 Cars|Litt|2012-09-02 white 71169 <NA>
10 Cars|Litt|2012-09-02 <NA> NA <NA>
11 Cars|Litt|2012-09-02 <NA> 70169 <NA>
12 Cars|Litt|2012-09-02 white NA <NA>
# Impute passport with "UNKNOWN"
df_passport <- visits_all %>%
mutate(passport_imputed = replace_na(passport, "UNKNOWN")) %>%
filter(is.na(passport)) %>%
select(id, id_text, visit, passport, passport_imputed)
head(df_passport, n = 12) id id_text visit passport
1 dc323bce-e583-d903-303e-9c865bc87e67 Cars|Litt|2012-09-02 1 <NA>
2 dc323bce-e583-d903-303e-9c865bc87e67 Cars|Litt|2012-09-02 2 <NA>
3 dc323bce-e583-d903-303e-9c865bc87e67 Cars|Litt|2012-09-02 3 <NA>
4 dc323bce-e583-d903-303e-9c865bc87e67 Cars|Litt|2012-09-02 4 <NA>
5 da7b1f55-c782-544f-ba8c-fe69d519dc85 Doll|Fran|2008-10-04 1 <NA>
6 da7b1f55-c782-544f-ba8c-fe69d519dc85 Doll|Fran|2008-10-04 2 <NA>
7 da7b1f55-c782-544f-ba8c-fe69d519dc85 Doll|Fran|2008-10-04 3 <NA>
8 da7b1f55-c782-544f-ba8c-fe69d519dc85 Doll|Fran|2008-10-04 4 <NA>
9 e335de09-0994-4111-3c15-6edcc17ae4bc Jimm|Doyl|2005-04-25 1 <NA>
10 e335de09-0994-4111-3c15-6edcc17ae4bc Jimm|Doyl|2005-04-25 2 <NA>
11 e335de09-0994-4111-3c15-6edcc17ae4bc Jimm|Doyl|2005-04-25 3 <NA>
12 e335de09-0994-4111-3c15-6edcc17ae4bc Jimm|Doyl|2005-04-25 4 <NA>
passport_imputed
1 UNKNOWN
2 UNKNOWN
3 UNKNOWN
4 UNKNOWN
5 UNKNOWN
6 UNKNOWN
7 UNKNOWN
8 UNKNOWN
9 UNKNOWN
10 UNKNOWN
11 UNKNOWN
12 UNKNOWN
# Sort records by patient ID and visit number (to prepare for forward/backward fill)
visits_all <- visits_all %>%
arrange(id_text, visit)
# Preview sorted data (tail)
tail(visits_all %>% select(id_text, visit, race, income), n = 12) id_text visit race income
29 Jimm|Doyl|2005-04-25 1 white 198522
30 Jimm|Doyl|2005-04-25 2 <NA> NA
31 Jimm|Doyl|2005-04-25 3 <NA> 197522
32 Jimm|Doyl|2005-04-25 4 white NA
33 Mari|Lind|1966-06-23 1 white 178323
34 Mari|Lind|1966-06-23 2 <NA> NA
35 Mari|Lind|1966-06-23 3 white NA
36 Mari|Lind|1966-06-23 4 <NA> 177323
37 Tamb|Bali|1953-05-10 1 white 140772
38 Tamb|Bali|1953-05-10 2 <NA> NA
39 Tamb|Bali|1953-05-10 3 white NA
40 Tamb|Bali|1953-05-10 4 <NA> 139772
# Forward fill missing race values within each patient (downward)
visits_all <- visits_all %>%
group_by(id_text) %>%
mutate(race_forward = race) %>%
fill(race_forward, .direction = "down") %>%
ungroup()
# Preview forward-filled race values (head)
head(visits_all %>% select(id_text, visit, race, race_forward), n = 12)# A tibble: 12 × 4
id_text visit race race_forward
<chr> <dbl> <chr> <chr>
1 Anit|Sánc|1960-05-07 1 white white
2 Anit|Sánc|1960-05-07 2 <NA> white
3 Anit|Sánc|1960-05-07 3 white white
4 Anit|Sánc|1960-05-07 4 <NA> white
5 Aure|Weis|2003-11-12 1 white white
6 Aure|Weis|2003-11-12 2 <NA> white
7 Aure|Weis|2003-11-12 3 white white
8 Aure|Weis|2003-11-12 4 <NA> white
9 Cars|Litt|2012-09-02 1 white white
10 Cars|Litt|2012-09-02 2 <NA> white
11 Cars|Litt|2012-09-02 3 <NA> white
12 Cars|Litt|2012-09-02 4 white white
# Backward fill missing race values within each patient (upward)
visits_all <- visits_all %>%
group_by(id_text) %>%
mutate(race_backward = race) %>%
fill(race_backward, .direction = "up") %>%
ungroup()
# Preview backward-filled race values (tail)
tail(visits_all %>% select(id_text, visit, race, race_backward), n = 12)# A tibble: 12 × 4
id_text visit race race_backward
<chr> <dbl> <chr> <chr>
1 Jimm|Doyl|2005-04-25 1 white white
2 Jimm|Doyl|2005-04-25 2 <NA> white
3 Jimm|Doyl|2005-04-25 3 <NA> white
4 Jimm|Doyl|2005-04-25 4 white white
5 Mari|Lind|1966-06-23 1 white white
6 Mari|Lind|1966-06-23 2 <NA> white
7 Mari|Lind|1966-06-23 3 white white
8 Mari|Lind|1966-06-23 4 <NA> <NA>
9 Tamb|Bali|1953-05-10 1 white white
10 Tamb|Bali|1953-05-10 2 <NA> white
11 Tamb|Bali|1953-05-10 3 white white
12 Tamb|Bali|1953-05-10 4 <NA> <NA>
# Calculate the global (overall) mean income
global_mean_income <- mean(visits_all$income, na.rm = TRUE)
# Use global mean to impute missing income values
visits_all <- visits_all %>%
mutate(income_global_mean = ifelse(is.na(income), global_mean_income, income))
# Preview global mean-imputed income (head)
head(visits_all %>% select(id_text, visit, income, income_global_mean), n = 12)# A tibble: 12 × 4
id_text visit income income_global_mean
<chr> <dbl> <dbl> <dbl>
1 Anit|Sánc|1960-05-07 1 61016 61016
2 Anit|Sánc|1960-05-07 2 NA 96123.
3 Anit|Sánc|1960-05-07 3 NA 96123.
4 Anit|Sánc|1960-05-07 4 60016 60016
5 Aure|Weis|2003-11-12 1 8752 8752
6 Aure|Weis|2003-11-12 2 NA 96123.
7 Aure|Weis|2003-11-12 3 NA 96123.
8 Aure|Weis|2003-11-12 4 7752 7752
9 Cars|Litt|2012-09-02 1 71169 71169
10 Cars|Litt|2012-09-02 2 NA 96123.
11 Cars|Litt|2012-09-02 3 70169 70169
12 Cars|Litt|2012-09-02 4 NA 96123.
# Calculate mean income per participant (id_text)
participant_mean_income <- visits_all %>%
group_by(id_text) %>%
summarise(mean_income_id = mean(income, na.rm = TRUE), .groups = "drop")
# Use participant-level mean to impute missing income values
visits_all <- visits_all %>%
left_join(participant_mean_income, by = "id_text") %>%
mutate(income_participant_mean = ifelse(is.na(income), mean_income_id, income))
# Preview participant mean-imputed income (tail)
tail(visits_all %>% select(id_text, visit, income, income_participant_mean), n = 12)# A tibble: 12 × 4
id_text visit income income_participant_mean
<chr> <dbl> <dbl> <dbl>
1 Jimm|Doyl|2005-04-25 1 198522 198522
2 Jimm|Doyl|2005-04-25 2 NA 198022
3 Jimm|Doyl|2005-04-25 3 197522 197522
4 Jimm|Doyl|2005-04-25 4 NA 198022
5 Mari|Lind|1966-06-23 1 178323 178323
6 Mari|Lind|1966-06-23 2 NA 177823
7 Mari|Lind|1966-06-23 3 NA 177823
8 Mari|Lind|1966-06-23 4 177323 177323
9 Tamb|Bali|1953-05-10 1 140772 140772
10 Tamb|Bali|1953-05-10 2 NA 140272
11 Tamb|Bali|1953-05-10 3 NA 140272
12 Tamb|Bali|1953-05-10 4 139772 139772
# Multiple imputation for fips, zip, healthcare_coverage, and income
patients_mice <- visits_all %>%
select(id_text, fips, zip, healthcare_coverage, income)
# Apply mice
imputed <- mice(patients_mice %>% select(-id_text), method = "pmm", m = 1, printFlag = FALSE)
# Retrieve completed dataset
mice_result <- complete(imputed)
colnames(mice_result) <- paste0(colnames(mice_result), "_imputed")
# Join ID and filter rows with original missing values
df_mice <- bind_cols(patients_mice, mice_result) %>%
filter(is.na(fips) | is.na(zip) | is.na(healthcare_coverage) | is.na(income)) %>%
select(id_text,
fips, fips_imputed,
zip, zip_imputed,
healthcare_coverage, healthcare_coverage_imputed,
income, income_imputed)
head(df_mice, n = 12)# A tibble: 12 × 9
id_text fips fips_imputed zip zip_imputed healthcare_coverage
<chr> <int> <int> <int> <int> <dbl>
1 Anit|Sánc|1960-05-07 25009 25009 1915 1915 1280070.
2 Anit|Sánc|1960-05-07 25009 25009 1915 1915 1280070.
3 Aure|Weis|2003-11-12 25027 25027 1606 1606 253645.
4 Aure|Weis|2003-11-12 25027 25027 1606 1606 253645.
5 Cars|Litt|2012-09-02 44007 44007 2861 2861 NA
6 Cars|Litt|2012-09-02 44007 44007 2861 2861 NA
7 Cars|Litt|2012-09-02 44007 44007 2861 2861 NA
8 Cars|Litt|2012-09-02 44007 44007 2861 2861 NA
9 Doll|Fran|2008-10-04 NA 25013 NA 2468 422516.
10 Doll|Fran|2008-10-04 NA 25027 NA 1030 422516.
11 Doll|Fran|2008-10-04 NA 25017 NA 2861 422516.
12 Doll|Fran|2008-10-04 NA 25017 NA 1606 422516.
# ℹ 3 more variables: healthcare_coverage_imputed <dbl>, income <dbl>,
# income_imputed <dbl>
# Optional: Inspect logged events from mice
imputed$loggedEvents it im dep meth
1 3 1 healthcare_coverage pmm
out
1 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
# Final summary: compare original and imputed values for rows with missingness
final_impute_check <- visits_all %>%
filter(is.na(race) | is.na(income)) %>%
select(id_text, visit,
race, race_forward, race_backward,
income, income_global_mean, income_participant_mean)
# Preview top 12
head(final_impute_check, n = 12)# A tibble: 12 × 8
id_text visit race race_forward race_backward income income_global_mean
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 Anit|Sánc|1… 2 <NA> white white NA 96123.
2 Anit|Sánc|1… 3 white white white NA 96123.
3 Anit|Sánc|1… 4 <NA> white <NA> 60016 60016
4 Aure|Weis|2… 2 <NA> white white NA 96123.
5 Aure|Weis|2… 3 white white white NA 96123.
6 Aure|Weis|2… 4 <NA> white <NA> 7752 7752
7 Cars|Litt|2… 2 <NA> white white NA 96123.
8 Cars|Litt|2… 3 <NA> white white 70169 70169
9 Cars|Litt|2… 4 white white white NA 96123.
10 Doll|Fran|2… 2 <NA> white white NA 96123.
11 Doll|Fran|2… 3 <NA> white white 187023 187023
12 Doll|Fran|2… 4 white white white NA 96123.
# ℹ 1 more variable: income_participant_mean <dbl>
# Preview bottom 12
tail(final_impute_check, n = 12)# A tibble: 12 × 8
id_text visit race race_forward race_backward income income_global_mean
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 Home|Quiñ|1… 2 <NA> white white NA 96123.
2 Home|Quiñ|1… 3 <NA> white white 6361 6361
3 Home|Quiñ|1… 4 white white white NA 96123.
4 Jimm|Doyl|2… 2 <NA> white white NA 96123.
5 Jimm|Doyl|2… 3 <NA> white white 197522 197522
6 Jimm|Doyl|2… 4 white white white NA 96123.
7 Mari|Lind|1… 2 <NA> white white NA 96123.
8 Mari|Lind|1… 3 white white white NA 96123.
9 Mari|Lind|1… 4 <NA> white <NA> 177323 177323
10 Tamb|Bali|1… 2 <NA> white white NA 96123.
11 Tamb|Bali|1… 3 white white white NA 96123.
12 Tamb|Bali|1… 4 <NA> white <NA> 139772 139772
# ℹ 1 more variable: income_participant_mean <dbl>
# Clean up temporary objects
rm( participant_mean_income, final_impute_check, global_mean_income, df_passport, df_mice,
patients_mice, imputed, mice_result, missing_summary, total_missing, n_missing_individuals, missing_rows
)Conclusion
Managing missing data is a core part of ensuring the validity, reproducibility, and interpretability of findings. Missingness can arise from a variety of sources, including participant attrition, evolving survey designs, and structural inconsistencies, and each requires a tailored approach. In multi-decade datasets, missingness arises from participant attrition, evolving survey designs, and changes in data collection priorities. Without a structured approach, these gaps can bias results, reduce statistical power, and undermine the credibility of findings.
An effective missing data strategy should:
- Begin with comprehensive diagnostics to assess the extent, patterns, and potential mechanisms of missingness -whether MCAR (Missing Completely at Random), MAR (Missing at Random), or MNAR (Missing Not at Random)—as this classification guides the choice of analytical remedies.
- Use appropriate imputation techniques—from simple methods like mean substitution to advanced approaches such as multiple imputation, mean imputation, last observation carried forward (LOCF), or combined strategies—selected according to the type of variable and nature of missingness.
- Incorporate sensitivity analyses to evaluate the robustness of results under different assumptions.
- Maintain transparent documentation of all decisions, code, and outputs, ensuring reproducibility and facilitating peer review.
Ultimately, the goal is not to replace missing values simply to achieve a complete dataset, but to apply appropriate methods that preserve the validity, reliability, and interpretability of the analysis. A structured, documented, and flexible approach allows data analysts to maximise the value of data while minimising bias and loss of precision—ensuring that findings can confidently and precisely inform both science and policy.
I also prepared Missing Data Management and Analysis via PowerPoint: Visit the Missing data presentation.
References
A. Book and Articles
- Dong Y, Peng C-YJ. Principled missing data methods for researchers. SpringerPlus. 2013;2(1):222.
- Bennett DA. How can I deal with missing data in my study? Australian and New Zealand journal of public health. 2001;25(5):464-9.
- StataCorp L. Stata statistical software: Release 13.(2013). College Station, TX: StataCorp LP. 2013.
- White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med. 2011;30(4):377-99. https://www.sagepub.com/sites/default/files/upm-binaries/45664_6.pdf
B. Useful websites:
https://missingdata.org/
https://www.missingdata.nl/missing-data/missing-data-methods/