Missing Data Management and Analysis

Introduction

Missing data is a common challenge in survey, clinical, and population-based datasets.
It can arise from participant dropout, skipped responses, changes in study design, or inconsistencies in data collection.
If not addressed appropriately, missingness can introduce bias, reduce statistical power, and undermine the reliability of results.

I also prepared Missing Data Management and Analysis via PowerPoint: Visit the Missing data presentation.

This work showcases two complementary missing data management projects, developed across different datasets and research contexts:

Stata-based missing data diagnostics and imputation
- Used structured diagnostic tools to quantify and visualise missingness, and identify patterns.
- Applied targeted imputation methods, including Multiple Imputation (MI), Last Observation Carried Forward (LOCF), Next Observation Carried Backward (NOCB), and hybrid approaches depending on variable type and study design.
R-based reproducible workflows for missing data
- Designed flexible pipelines for detecting, cleaning, and imputing missing values.
- Demonstrated simple (mean/median) and advanced methods (MICE, model-based imputation), as well as forward/backward filling and participant-level imputation.

By combining statistical rigour, methodological adaptability, and transparent documentation, these approaches ensure that findings remain robust, reproducible, and comparable across studies, regardless of the data source.

STATA-Based Missing Data management and Analysis Project

Some outputs are not presented here, as certain variables had small cell counts that could potentially compromise participant confidentiality.
Where the number of observations fell below accepted thresholds, they were withheld to remain in line with the minimum data reporting standards for sensitive datasets.

The overall and aggregated results, however, have been published and are available in the following peer-reviewed journal articles:

My work has resulted in three peer-reviewed Q1 publications:
- Preterm Birth: https://doi.org/10.1016/j.midw.2022.103334
- Low Birthweight: https://doi.org/10.1093/eurpub/ckab033
- Caesarean Section and Labour Intervention: https://doi.org/10.1007/s43032-023-01219-7

In large-scale and longitudinal data, missing data is almost inevitable. It can arise from participant attrition, skipped responses, evolving survey designs, or inconsistencies in measurement over time.
If not addressed appropriately, missingness can reduce statistical power, bias estimates, and compromise the validity of conclusions.

1.1 Scientific Context

Longitudinal studies, such as ALSWH, collect data from the same participants repeatedly over long periods. While this design allows for rich analysis of change over time, it is also prone to loss to follow-up and intermittent missingness.

International best practice suggests that when missingness exceeds 5–10% for a variable, imputation should be considered to improve precision and reduce bias (Rubin, 1987; Sterne et al., 2009). In this project:
- Maternal pre-pregnancy BMI and age at menarche had missingness above 10%.
- Missingness was often non-monotone, requiring tailored imputation strategies for different variable types.

1.2 Diagnostic Approaches

I began by systematically assessing the extent and structure of missingness:

Used STATA packages misschk, mdesc, and xtpatternvar to quantify missing values per variable and identify patterns.
Classified patterns into monotone (e.g., dropout) and intermittent (e.g., missed responses but later re-entry).
Inspected missingness by birth order and survey wave to account for pregnancy-specific variables.
Flagged variables with >5% missingness for potential imputation.

1.3 Missing Data Mechanisms

Before selecting imputation methods, I classified variables according to the statistical mechanism of missingness:

MCAR – Missing Completely at Random: The probability of a value being missing is unrelated to either the observed or unobserved data.
MAR – Missing at Random: The probability of missingness depends only on observed data, not the missing values themselves.
MNAR – Missing Not at Random: The probability of missingness depends on the unobserved data itself.

1.4 Criteria for Imputation

A rule of thumb from the literature recommends initiating imputation when the missing proportion exceeds 5–10% of the variable’s observations. In this study:
- Maternal BMI and age at menarche exceeded 10% missingness.
- Time-varying health conditions (e.g., hypertension) had intermittent missingness between waves.

1.5 Imputation Methods Applied Using STATA

This Project has been completed using ALSWH data source

a) Mean Imputation (Row Mean)
- Applied to pre-pregnancy BMI (<15% missingness).
- Used the participant’s own available BMI values from other waves to calculate a personalised row mean.
- Observed mean: 24.46 kg/m²; Imputed mean: 24.63 kg/m² — indicating close alignment.

b) Multiple Imputation (MI)
- Applied to age at menarche (>10% missingness), a static variable recorded only once (Survey 2).
- Used Multivariate Imputation by Chained Equations (MICE) with 20 imputations, following guidelines to reduce sampling error.
- Incorporated auxiliary variables predictive of missingness.
- Conducted diagnostics: density plots, distribution overlays, and observed vs. imputed comparison.

c) Last Observation Carried Forward (LOCF)
- Used for chronic conditions (e.g., hypertension, diabetes) that persist after diagnosis.
- Ensured stability in prevalence estimates over time.

d) Next Observation Carried Backward (NOCB)
- Applied to lifetime exposure variables (e.g., partner violence). If reported in later waves, status was carried backward.

e) Combined LOCF + NOCB
- Applied to variables requiring lifetime ascertainment, maximising completeness in both time directions; for example, country of birth.

1.6 Sensitivity Analyses

To ensure the robustness of results: - Conducted complete-case analysis alongside imputed datasets. - Compared distributions of observed vs. imputed data to detect anomalies. - Re-estimated models with alternative imputation strategies to check stability of effect estimates.

Table: Summary of Missing Data Handling Techniques

Variable Type	Example Variable	Method Applied	Rationale
Static, continuous	Age at menarche	Multiple Imputation (20 datasets, chained equations)	Single-time measurement; high missingness; categorical imputation adjusted for predictors
Time-varying, continuous	BMI prior to birth	Row mean imputation using available values from same participant	Preserves within-person trajectory; missing at intermittent waves
Chronic binary condition	Hypertension, diabetes	Last Observation Carried Forward (LOCF)	Lifetime conditions assumed persistent once reported
Lifetime event binary	Ever violated by partner	LOCF + Next Observation Carried Backward (NOCB)	Ensures capture of lifetime exposure regardless of reporting wave

In summary, the combination of diagnostic mapping, mechanism classification, tailored imputation, and sensitivity analyses allowed me to address missingness in a scientifically sound manner, maximising the analytic value of this 20-year longitudinal dataset.

Final Reflection:
I have had the privilege of contributing to one of Australia’s most significant longitudinal studies, producing results that are both methodologically robust and policy-relevant.
The lessons I have gained — about rigour, adaptability, and the potential of data to drive meaningful change — will remain central to my work and collaborations. I will apply these skills in future projects that connects advanced analytics with real-world health outcomes, both in Australia and internationally.

If you are interested in exploring each paper in greater depth, including the methods applied, please read the links below. I hope you find them insightful.

Preterm Birth: https://doi.org/10.1016/j.midw.2022.103334
Low Birthweight: https://doi.org/10.1093/eurpub/ckab033
Caesarean Section and Labour Intervention: https://doi.org/10.1007/s43032-023-01219-7

R-Based Missing Data management and Analysis Project

Missing Value Identification and Imputations

A. Identifying Missing Values

is.na(), anyNA(), sum(is.na(x)), colSums() – detect/count missing values
complete.cases() – filter complete rows
Convert blanks to NA: x[x == ““] <- NA

B. Handling Missing Values

Basic Removal/Replacement: na.omit(), replace_na() (tidyr), ifelse()
Forward/Backward Fill: fill() (tidyr) with .direction = “down” / “up”
Statistical Imputation:

-Simple: mean/median imputation

-Multiple: mice::mice()()

-Advanced: Hmisc::aregImpute, missForest, Amelia, zoo::na.fill

Missing data is a common issue in real-world datasets. In R, handling missing values properly is essential to ensure the integrity of analyses and avoid misleading results.

This section covers how to identify, clean, and impute missing values using base R and tidyverse tools. Techniques range from basic filtering to advanced imputation models.

➡️ Missing values are typically represented as NA in R. Empty strings (e.g., "") or special codes (e.g., -99, "missing") may also indicate missingness and need conversion before analysis.

🔹 Summary of Functions for Handling Missing Values

Task	Function / Code	Purpose
Detect missing values	`is.na(x)`, `anyNA(x)`, `sum(is.na(x))`	Identify/count missing values
Missing per column	`colSums(is.na(df))`	Count NAs per column
Filter complete rows	`complete.cases(df)`	Keep only rows without any missing values
Convert blanks to `NA`	`x[x == ""] <- NA`	Convert empty strings to missing values
Remove missing rows	`na.omit(df)`	Remove rows with any missing value
Replace missing values	`replace_na(df, list(...))`, `ifelse(is.na(x), ...)`	Impute or replace NA with custom logic
Forward/backward fill	`fill(df, .direction = "down" / "up")`	Fill NA using nearby values
Mean/median imputation	`x[is.na(x)] <- mean(x, na.rm = TRUE)`	Replace NA with summary statistic
Multiple imputation	`mice::mice(df)`	Create multiple imputed datasets using chained equations
Advanced imputation	`Hmisc::aregImpute`, `missForest`, `Amelia`, `zoo::na.fill()`	Specialized techniques for complex imputations

📚 Further Learning and Resources

Additional tutorials, demonstrations, and datasets are available on:

YouTube Channel – AnalyticsHub
https://studio.youtube.com/channel/UCLVKP0g8GvHhOh0kn20T8eg/videos/upload?filter=%5B%5D&sort=%7B%22columnType%22%3A%22date%22%2C%22sortOrder%22%3A%22DESCENDING%22%7D
GitHub Repository
https://github.com/HabtamuBizuayehu?tab=repositories

                                    id  birthdate deathdate         ssn
1 898a6256-7ffb-dbe2-e24d-12fda2fedcfd 1966-06-23           999-98-5513
2 5e0a6984-38d7-c604-f55f-f1de5e933768 1986-07-04           999-58-5560
3 e335de09-0994-4111-3c15-6edcc17ae4bc 2005-04-25           999-48-3846
4 45b89342-dc05-8e57-8eee-9ed68ec42378 1984-08-24           999-40-3433
5 916b1ac8-56c8-ec1b-3b9a-721336a74912 1953-05-10           999-16-3735
6 dc323bce-e583-d903-303e-9c865bc87e67 2012-09-02           999-90-3605
    drivers   passport prefix     first     middle          last suffix
1 S99971131 X62454905X   Mrs.  Maris768              Lindgren255     NA
2 S99931679 X61479569X   Mrs. Hattie299  Reatha769      Nader710     NA
3 S99958413               Ms.  Jimmie93 Towanda270      Doyle959     NA
4 S99985178 X27670495X    Mr. Dwight645  Marlin805      Hilll811     NA
5 S99968840 X83646024X    Ms.  Tambra47 Brittni468 Balistreri607     NA
6                             Carson894 Raphael767     Littel644     NA
     maiden marital  race   ethnicity gender                       birthplace
1   Dach178       D white nonhispanic      F        Oxford  Massachusetts  US
2 Harber290       M white nonhispanic      F North Reading  Massachusetts  US
3                   white nonhispanic      F        Boston  Massachusetts  US
4                 M white nonhispanic      M    Framingham  Massachusetts  US
5                 S white nonhispanic      F     Lexington  Massachusetts  US
6                   white nonhispanic      M      Plymouth  Massachusetts  US
                        address        city         state           county
1          962 Stiedemann Vista    Aquinnah Massachusetts     Dukes County
2            720 O'Keefe Arcade      Newton Massachusetts Middlesex County
3       493 Cruickshank Mission   Holliston Massachusetts Middlesex County
4              810 Wolf Pathway      Malden Massachusetts Middlesex County
5      146 Rowe Village Suite 3 New Bedford Massachusetts   Bristol County
6 467 Collier Stravenue Unit 12   Attleboro Massachusetts   Bristol County
   fips  zip      lat       lon healthcare_expenses healthcare_coverage income
1    NA    0 41.29618 -70.77458           343608.18           884243.46 178323
2 25017 2468 42.29180 -71.14930           169293.06           801407.90  86384
3    NA    0 42.24196 -71.46440            64178.99             5906.14 198522
4 25017 2148 42.43610 -71.14328            36010.63           472861.90  25909
5 25005 2743 41.69576 -71.01391           281198.90           310261.21 140772
6 44007 2861 41.94870 -71.30420            45972.75                0.00  71169
  visit
1     1
2     1
3     1
4     1
5     1
6     1

# Summarise missing values across all columns
missing_summary <- colSums(is.na(visits_all))
print(missing_summary[missing_summary > 0])

          deathdate             drivers            passport              prefix 
                 40                   4                  12                   8 
             middle              suffix              maiden             marital 
                  4                  40                  28                  16 
               race                fips                 zip healthcare_coverage 
                 20                  12                  12                   4 
             income 
                 20

# Total number of missing cells
total_missing <- sum(is.na(visits_all))
cat("Total missing values in dataset:", total_missing, "\n")

Total missing values in dataset: 220

# Count of individuals with at least one missing value
n_missing_individuals <- sum(!complete.cases(visits_all))
cat("Number of individuals with at least one missing value:", n_missing_individuals, "\n")

Number of individuals with at least one missing value: 40

# Sort records by patient ID and visit number
visits_all <- visits_all %>%
  arrange(id_text, visit)

# View rows with any missing values
missing_rows <- visits_all %>%
  filter(!complete.cases(.))
head(missing_rows, n = 12) %>% select(id_text, race, income, passport)

                id_text  race income   passport
1  Anit|Sánc|1960-05-07 white  61016  X9157439X
2  Anit|Sánc|1960-05-07  <NA>     NA  X9157439X
3  Anit|Sánc|1960-05-07 white     NA  X9157439X
4  Anit|Sánc|1960-05-07  <NA>  60016  X9157439X
5  Aure|Weis|2003-11-12 white   8752 X17586296X
6  Aure|Weis|2003-11-12  <NA>     NA X17586296X
7  Aure|Weis|2003-11-12 white     NA X17586296X
8  Aure|Weis|2003-11-12  <NA>   7752 X17586296X
9  Cars|Litt|2012-09-02 white  71169       <NA>
10 Cars|Litt|2012-09-02  <NA>     NA       <NA>
11 Cars|Litt|2012-09-02  <NA>  70169       <NA>
12 Cars|Litt|2012-09-02 white     NA       <NA>

# Impute passport with "UNKNOWN"
df_passport <- visits_all %>%
  mutate(passport_imputed = replace_na(passport, "UNKNOWN")) %>%
  filter(is.na(passport)) %>%
  select(id, id_text, visit, passport, passport_imputed)

head(df_passport, n = 12)

                                     id              id_text visit passport
1  dc323bce-e583-d903-303e-9c865bc87e67 Cars|Litt|2012-09-02     1     <NA>
2  dc323bce-e583-d903-303e-9c865bc87e67 Cars|Litt|2012-09-02     2     <NA>
3  dc323bce-e583-d903-303e-9c865bc87e67 Cars|Litt|2012-09-02     3     <NA>
4  dc323bce-e583-d903-303e-9c865bc87e67 Cars|Litt|2012-09-02     4     <NA>
5  da7b1f55-c782-544f-ba8c-fe69d519dc85 Doll|Fran|2008-10-04     1     <NA>
6  da7b1f55-c782-544f-ba8c-fe69d519dc85 Doll|Fran|2008-10-04     2     <NA>
7  da7b1f55-c782-544f-ba8c-fe69d519dc85 Doll|Fran|2008-10-04     3     <NA>
8  da7b1f55-c782-544f-ba8c-fe69d519dc85 Doll|Fran|2008-10-04     4     <NA>
9  e335de09-0994-4111-3c15-6edcc17ae4bc Jimm|Doyl|2005-04-25     1     <NA>
10 e335de09-0994-4111-3c15-6edcc17ae4bc Jimm|Doyl|2005-04-25     2     <NA>
11 e335de09-0994-4111-3c15-6edcc17ae4bc Jimm|Doyl|2005-04-25     3     <NA>
12 e335de09-0994-4111-3c15-6edcc17ae4bc Jimm|Doyl|2005-04-25     4     <NA>
   passport_imputed
1           UNKNOWN
2           UNKNOWN
3           UNKNOWN
4           UNKNOWN
5           UNKNOWN
6           UNKNOWN
7           UNKNOWN
8           UNKNOWN
9           UNKNOWN
10          UNKNOWN
11          UNKNOWN
12          UNKNOWN

# Sort records by patient ID and visit number (to prepare for forward/backward fill)
visits_all <- visits_all %>%
  arrange(id_text, visit)

# Preview sorted data (tail)
tail(visits_all %>% select(id_text, visit, race, income), n = 12)

                id_text visit  race income
29 Jimm|Doyl|2005-04-25     1 white 198522
30 Jimm|Doyl|2005-04-25     2  <NA>     NA
31 Jimm|Doyl|2005-04-25     3  <NA> 197522
32 Jimm|Doyl|2005-04-25     4 white     NA
33 Mari|Lind|1966-06-23     1 white 178323
34 Mari|Lind|1966-06-23     2  <NA>     NA
35 Mari|Lind|1966-06-23     3 white     NA
36 Mari|Lind|1966-06-23     4  <NA> 177323
37 Tamb|Bali|1953-05-10     1 white 140772
38 Tamb|Bali|1953-05-10     2  <NA>     NA
39 Tamb|Bali|1953-05-10     3 white     NA
40 Tamb|Bali|1953-05-10     4  <NA> 139772

# Forward fill missing race values within each patient (downward)
visits_all <- visits_all %>%
  group_by(id_text) %>%
  mutate(race_forward = race) %>%
  fill(race_forward, .direction = "down") %>%
  ungroup()

# Preview forward-filled race values (head)
head(visits_all %>% select(id_text, visit, race, race_forward), n = 12)

# A tibble: 12 × 4
   id_text              visit race  race_forward
   <chr>                <dbl> <chr> <chr>       
 1 Anit|Sánc|1960-05-07     1 white white       
 2 Anit|Sánc|1960-05-07     2 <NA>  white       
 3 Anit|Sánc|1960-05-07     3 white white       
 4 Anit|Sánc|1960-05-07     4 <NA>  white       
 5 Aure|Weis|2003-11-12     1 white white       
 6 Aure|Weis|2003-11-12     2 <NA>  white       
 7 Aure|Weis|2003-11-12     3 white white       
 8 Aure|Weis|2003-11-12     4 <NA>  white       
 9 Cars|Litt|2012-09-02     1 white white       
10 Cars|Litt|2012-09-02     2 <NA>  white       
11 Cars|Litt|2012-09-02     3 <NA>  white       
12 Cars|Litt|2012-09-02     4 white white

# Backward fill missing race values within each patient (upward)
visits_all <- visits_all %>%
  group_by(id_text) %>%
  mutate(race_backward = race) %>%
  fill(race_backward, .direction = "up") %>%
  ungroup()

# Preview backward-filled race values (tail)
tail(visits_all %>% select(id_text, visit, race, race_backward), n = 12)

# A tibble: 12 × 4
   id_text              visit race  race_backward
   <chr>                <dbl> <chr> <chr>        
 1 Jimm|Doyl|2005-04-25     1 white white        
 2 Jimm|Doyl|2005-04-25     2 <NA>  white        
 3 Jimm|Doyl|2005-04-25     3 <NA>  white        
 4 Jimm|Doyl|2005-04-25     4 white white        
 5 Mari|Lind|1966-06-23     1 white white        
 6 Mari|Lind|1966-06-23     2 <NA>  white        
 7 Mari|Lind|1966-06-23     3 white white        
 8 Mari|Lind|1966-06-23     4 <NA>  <NA>         
 9 Tamb|Bali|1953-05-10     1 white white        
10 Tamb|Bali|1953-05-10     2 <NA>  white        
11 Tamb|Bali|1953-05-10     3 white white        
12 Tamb|Bali|1953-05-10     4 <NA>  <NA>

# Calculate the global (overall) mean income
global_mean_income <- mean(visits_all$income, na.rm = TRUE)

# Use global mean to impute missing income values
visits_all <- visits_all %>%
  mutate(income_global_mean = ifelse(is.na(income), global_mean_income, income))

# Preview global mean-imputed income (head)
head(visits_all %>% select(id_text, visit, income, income_global_mean), n = 12)

# A tibble: 12 × 4
   id_text              visit income income_global_mean
   <chr>                <dbl>  <dbl>              <dbl>
 1 Anit|Sánc|1960-05-07     1  61016             61016 
 2 Anit|Sánc|1960-05-07     2     NA             96123.
 3 Anit|Sánc|1960-05-07     3     NA             96123.
 4 Anit|Sánc|1960-05-07     4  60016             60016 
 5 Aure|Weis|2003-11-12     1   8752              8752 
 6 Aure|Weis|2003-11-12     2     NA             96123.
 7 Aure|Weis|2003-11-12     3     NA             96123.
 8 Aure|Weis|2003-11-12     4   7752              7752 
 9 Cars|Litt|2012-09-02     1  71169             71169 
10 Cars|Litt|2012-09-02     2     NA             96123.
11 Cars|Litt|2012-09-02     3  70169             70169 
12 Cars|Litt|2012-09-02     4     NA             96123.

# Calculate mean income per participant (id_text)
participant_mean_income <- visits_all %>%
  group_by(id_text) %>%
  summarise(mean_income_id = mean(income, na.rm = TRUE), .groups = "drop")

# Use participant-level mean to impute missing income values
visits_all <- visits_all %>%
  left_join(participant_mean_income, by = "id_text") %>%
  mutate(income_participant_mean = ifelse(is.na(income), mean_income_id, income))

# Preview participant mean-imputed income (tail)
tail(visits_all %>% select(id_text, visit, income, income_participant_mean), n = 12)

# A tibble: 12 × 4
   id_text              visit income income_participant_mean
   <chr>                <dbl>  <dbl>                   <dbl>
 1 Jimm|Doyl|2005-04-25     1 198522                  198522
 2 Jimm|Doyl|2005-04-25     2     NA                  198022
 3 Jimm|Doyl|2005-04-25     3 197522                  197522
 4 Jimm|Doyl|2005-04-25     4     NA                  198022
 5 Mari|Lind|1966-06-23     1 178323                  178323
 6 Mari|Lind|1966-06-23     2     NA                  177823
 7 Mari|Lind|1966-06-23     3     NA                  177823
 8 Mari|Lind|1966-06-23     4 177323                  177323
 9 Tamb|Bali|1953-05-10     1 140772                  140772
10 Tamb|Bali|1953-05-10     2     NA                  140272
11 Tamb|Bali|1953-05-10     3     NA                  140272
12 Tamb|Bali|1953-05-10     4 139772                  139772

# Multiple imputation for fips, zip, healthcare_coverage, and income
patients_mice <- visits_all %>%
  select(id_text, fips, zip, healthcare_coverage, income)

# Apply mice
imputed <- mice(patients_mice %>% select(-id_text), method = "pmm", m = 1, printFlag = FALSE)

# Retrieve completed dataset
mice_result <- complete(imputed)
colnames(mice_result) <- paste0(colnames(mice_result), "_imputed")

# Join ID and filter rows with original missing values
df_mice <- bind_cols(patients_mice, mice_result) %>%
  filter(is.na(fips) | is.na(zip) | is.na(healthcare_coverage) | is.na(income)) %>%
  select(id_text,
         fips, fips_imputed,
         zip, zip_imputed,
         healthcare_coverage, healthcare_coverage_imputed,
         income, income_imputed)

head(df_mice, n = 12)

# A tibble: 12 × 9
   id_text               fips fips_imputed   zip zip_imputed healthcare_coverage
   <chr>                <int>        <int> <int>       <int>               <dbl>
 1 Anit|Sánc|1960-05-07 25009        25009  1915        1915            1280070.
 2 Anit|Sánc|1960-05-07 25009        25009  1915        1915            1280070.
 3 Aure|Weis|2003-11-12 25027        25027  1606        1606             253645.
 4 Aure|Weis|2003-11-12 25027        25027  1606        1606             253645.
 5 Cars|Litt|2012-09-02 44007        44007  2861        2861                 NA 
 6 Cars|Litt|2012-09-02 44007        44007  2861        2861                 NA 
 7 Cars|Litt|2012-09-02 44007        44007  2861        2861                 NA 
 8 Cars|Litt|2012-09-02 44007        44007  2861        2861                 NA 
 9 Doll|Fran|2008-10-04    NA        25013    NA        2468             422516.
10 Doll|Fran|2008-10-04    NA        25027    NA        1030             422516.
11 Doll|Fran|2008-10-04    NA        25017    NA        2861             422516.
12 Doll|Fran|2008-10-04    NA        25017    NA        1606             422516.
# ℹ 3 more variables: healthcare_coverage_imputed <dbl>, income <dbl>,
#   income_imputed <dbl>

# Optional: Inspect logged events from mice
imputed$loggedEvents

  it im                 dep meth
1  3  1 healthcare_coverage  pmm
                                                                                                                                                                                                                                                       out
1 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?

# Final summary: compare original and imputed values for rows with missingness

final_impute_check <- visits_all %>%
  filter(is.na(race) | is.na(income)) %>%
  select(id_text, visit,
         race, race_forward, race_backward,
         income, income_global_mean, income_participant_mean)

# Preview top 12
head(final_impute_check, n = 12)

# A tibble: 12 × 8
   id_text      visit race  race_forward race_backward income income_global_mean
   <chr>        <dbl> <chr> <chr>        <chr>          <dbl>              <dbl>
 1 Anit|Sánc|1…     2 <NA>  white        white             NA             96123.
 2 Anit|Sánc|1…     3 white white        white             NA             96123.
 3 Anit|Sánc|1…     4 <NA>  white        <NA>           60016             60016 
 4 Aure|Weis|2…     2 <NA>  white        white             NA             96123.
 5 Aure|Weis|2…     3 white white        white             NA             96123.
 6 Aure|Weis|2…     4 <NA>  white        <NA>            7752              7752 
 7 Cars|Litt|2…     2 <NA>  white        white             NA             96123.
 8 Cars|Litt|2…     3 <NA>  white        white          70169             70169 
 9 Cars|Litt|2…     4 white white        white             NA             96123.
10 Doll|Fran|2…     2 <NA>  white        white             NA             96123.
11 Doll|Fran|2…     3 <NA>  white        white         187023            187023 
12 Doll|Fran|2…     4 white white        white             NA             96123.
# ℹ 1 more variable: income_participant_mean <dbl>

# Preview bottom 12
tail(final_impute_check, n = 12)

# A tibble: 12 × 8
   id_text      visit race  race_forward race_backward income income_global_mean
   <chr>        <dbl> <chr> <chr>        <chr>          <dbl>              <dbl>
 1 Home|Quiñ|1…     2 <NA>  white        white             NA             96123.
 2 Home|Quiñ|1…     3 <NA>  white        white           6361              6361 
 3 Home|Quiñ|1…     4 white white        white             NA             96123.
 4 Jimm|Doyl|2…     2 <NA>  white        white             NA             96123.
 5 Jimm|Doyl|2…     3 <NA>  white        white         197522            197522 
 6 Jimm|Doyl|2…     4 white white        white             NA             96123.
 7 Mari|Lind|1…     2 <NA>  white        white             NA             96123.
 8 Mari|Lind|1…     3 white white        white             NA             96123.
 9 Mari|Lind|1…     4 <NA>  white        <NA>          177323            177323 
10 Tamb|Bali|1…     2 <NA>  white        white             NA             96123.
11 Tamb|Bali|1…     3 white white        white             NA             96123.
12 Tamb|Bali|1…     4 <NA>  white        <NA>          139772            139772 
# ℹ 1 more variable: income_participant_mean <dbl>

# Clean up temporary objects 
rm( participant_mean_income, final_impute_check, global_mean_income, df_passport, df_mice,
    patients_mice, imputed, mice_result, missing_summary, total_missing, n_missing_individuals, missing_rows
)

Conclusion

Managing missing data is a core part of ensuring the validity, reproducibility, and interpretability of findings. Missingness can arise from a variety of sources, including participant attrition, evolving survey designs, and structural inconsistencies, and each requires a tailored approach. In multi-decade datasets, missingness arises from participant attrition, evolving survey designs, and changes in data collection priorities. Without a structured approach, these gaps can bias results, reduce statistical power, and undermine the credibility of findings.

An effective missing data strategy should:

Begin with comprehensive diagnostics to assess the extent, patterns, and potential mechanisms of missingness -whether MCAR (Missing Completely at Random), MAR (Missing at Random), or MNAR (Missing Not at Random)—as this classification guides the choice of analytical remedies.
Use appropriate imputation techniques—from simple methods like mean substitution to advanced approaches such as multiple imputation, mean imputation, last observation carried forward (LOCF), or combined strategies—selected according to the type of variable and nature of missingness.
Incorporate sensitivity analyses to evaluate the robustness of results under different assumptions.
Maintain transparent documentation of all decisions, code, and outputs, ensuring reproducibility and facilitating peer review.

Ultimately, the goal is not to replace missing values simply to achieve a complete dataset, but to apply appropriate methods that preserve the validity, reliability, and interpretability of the analysis. A structured, documented, and flexible approach allows data analysts to maximise the value of data while minimising bias and loss of precision—ensuring that findings can confidently and precisely inform both science and policy.

I also prepared Missing Data Management and Analysis via PowerPoint: Visit the Missing data presentation.

References

A. Book and Articles

Dong Y, Peng C-YJ. Principled missing data methods for researchers. SpringerPlus. 2013;2(1):222.
Bennett DA. How can I deal with missing data in my study? Australian and New Zealand journal of public health. 2001;25(5):464-9.
StataCorp L. Stata statistical software: Release 13.(2013). College Station, TX: StataCorp LP. 2013.
White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med. 2011;30(4):377-99. https://www.sagepub.com/sites/default/files/upm-binaries/45664_6.pdf

B. Useful websites:
https://missingdata.org/
https://www.missingdata.nl/missing-data/missing-data-methods/