Assessing Covariates Balance

Detailled Script.

Tarik Benmarhnia https://profiles.ucsd.edu/tarik.benmarhnia (UCSD & Scripps Institute)https://benmarhniaresearch.ucsd.edu/ , Marie-Abèle Bind https://scholar.harvard.edu/marie-abele (Biostatistics Center, Massachusetts General Hospital)https://biostatistics.massgeneral.org/faculty/marie-abele-bind-phd/ , Léo Zabrocki https://lzabrocki.github.io/ (Paris School of Economics)https://www.parisschoolofeconomics.eu/fr/zabrocki-leo/
2022-07-29

In this document, we provide all steps and R codes to evaluate if days with heat wave are similar to days without heat wave for a set of confounding factors. Should you have any questions, need help to reproduce the analysis or find coding errors, please do not hesitate to contact us at .

Required Packages

To reproduce exactly the eda_covariates_balance.html document, we first need to have installed:

Once everything is set up, we load the following packages:

# load required packages
library(knitr) # for creating the R Markdown document
library(here) # for files paths organization
library(tidyverse) # for data manipulation and visualization
library(broom) # for cleaning regression outputs
library(Cairo) # for printing custom police of graphs
library(DT) # for displaying the data as tables

We finally load our custom ggplot2 theme for graphs:

# load ggplot custom theme
source(here::here("inputs", "2.functions",
                  "script_theme_tufte.R"))
# define nice colors
my_blue <- "#0081a7"
my_orange <- "#fb8500"

Checking Covariates Balance

We load the environmental data:

# load the data
data <-
  readRDS(here::here("inputs", "1.data", "environmental_data.rds")) %>%
  # we recode the heat_wave variable
  mutate(heat_wave = ifelse(heat_wave == 1, "Days with Heat Wave", "Days without Heat Wave"))

Continuous Covariates

We first explore whether the relative humidity measured in \(t\) and the lags of O\(_{3}\) and NO\(_{2}\) concentrations are balanced. We plot below the density distribution of each covariate by treatment group:

Please show me the code!
# make the graph
graph_continuous_cov_densities <- data %>%
  # pivot covariates to long format
  pivot_longer(
    cols = c(humidity_relative, o3_lag_1:o3_lag_3, no2_lag_1:no2_lag_3),
    names_to = "covariate",
    values_to = "value"
  ) %>%
  # change covariate names
  mutate(
    covariate = case_when(
      covariate == "humidity_relative" ~ "Relative Humidity (%)",
      covariate == "o3_lag_1" ~ "O3 in t-1 (µg/m³)",
      covariate == "o3_lag_2" ~ "O3 in t-2 (µg/m³)",
      covariate == "o3_lag_3" ~ "O3 in t-3 (µg/m³)",
      covariate == "no2_lag_1" ~ "NO2 in t-1 (µg/m³)",
      covariate == "no2_lag_2" ~ "NO2 in t-2 (µg/m³)",
      covariate == "no2_lag_3" ~ "NO2 in t-3 (µg/m³)"
    )
  ) %>%
  # reorder covariates
  mutate(
    covariate = fct_relevel(
      covariate,
      "Relative Humidity (%)",
      "O3 in t-1 (µg/m³)",
      "O3 in t-2 (µg/m³)",
      "O3 in t-3 (µg/m³)",
      "NO2 in t-1 (µg/m³)",
      "NO2 in t-2 (µg/m³)",
      "NO2 in t-3 (µg/m³)"
    )
  ) %>%
  # make density graph
  ggplot(., aes(x = value,
                color = fct_rev(heat_wave))) +
  geom_density() +
  scale_color_manual(name = "Group:", values = c(my_blue, my_orange)) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  facet_wrap(~ covariate, scales = "free") +
  xlab("Covariate Value") + ylab("Density") +
  ggtitle("Density Distribution of Continuous Covariates by Treatment") + 
  theme_tufte() +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank())

# display the graph
graph_continuous_cov_densities
Please show me the code!
# save the graph
ggsave(
  graph_continuous_cov_densities + labs(title = NULL),
  filename = here::here("inputs", "3.outputs", "2.graphs", "graph_continuous_cov_densities.pdf"),
  width = 20,
  height = 15,
  units = "cm",
  device = cairo_pdf
)

On this graph, we can see that the relative humidity and the lags of O\(_{3}\) are imbalanced across the treatment and control groups. It is less the case for NO\(_{2}\). As an alternative to density distributions, we can summarize the imbalance by computing, for each covariate, the absolute standardized mean difference between treatment and control groups. The absolute standardized mean difference of a covariate is just the absolute value of the difference in means between treated and control units divided by the standard deviation of the treatment group. We can simply compute and plot this metric using the following code:

Please show me the code!
# reshape the data into long
data_continuous_cov <- data %>%
  dplyr::select(heat_wave, humidity_relative, o3_lag_1:o3_lag_3, no2_lag_1:no2_lag_3) %>%
  pivot_longer(cols = -c(heat_wave),
               names_to = "variable",
               values_to = "value") %>%
  mutate(
    covariate_name = NA %>%
      ifelse(str_detect(variable, "o3"), "O3", .) %>%
      ifelse(
        str_detect(variable, "humidity_relative"),
        "Relative Humidity",
        .
      ) %>%
      ifelse(str_detect(variable, "no2"), "NO2", .)
  ) %>%
  mutate(
    time = "Lag 0" %>%
      ifelse(str_detect(variable, "lag_1"), "Lag 1", .) %>%
      ifelse(str_detect(variable, "lag_2"), "Lag 2", .) %>%
      ifelse(str_detect(variable, "lag_3"), "Lag 3", .)
  ) %>%
  mutate(time = fct_relevel(time, "Lag 3", "Lag 2", "Lag 1", "Lag 0")) %>%
  dplyr::select(heat_wave, covariate_name, time, value)

# compute absolute difference in  means
data_abs_difference <- data_continuous_cov %>%
  group_by(covariate_name, time, heat_wave) %>%
  summarise(mean_value = mean(value, na.rm = TRUE)) %>%
  summarise(abs_difference = abs(mean_value[2] - mean_value[1]))

# compute treatment covariates standard deviation
data_sd <-  data_continuous_cov %>%
  filter(heat_wave == "Days with Heat Wave") %>%
  group_by(covariate_name, time, heat_wave) %>%
  summarise(sd_treatment = sd(value, na.rm = TRUE)) %>%
  ungroup() %>%
  dplyr::select(covariate_name, time, sd_treatment)

# compute standardized differences
data_standardized_difference <-
  left_join(data_abs_difference, data_sd, by = c("covariate_name", "time")) %>%
  mutate(standardized_difference = abs_difference / sd_treatment) %>%
  dplyr::select(-c(abs_difference, sd_treatment))

# make the graph
graph_std_diff_continuous_cov <- ggplot(data_standardized_difference, aes(y = covariate_name, x = standardized_difference)) +
  geom_vline(xintercept = 0, size = 0.3) +
  geom_vline(xintercept = 0.1, color = my_orange) +
  geom_point(size = 2, color = my_blue) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  facet_wrap(~ fct_rev(time), nrow = 1) +
  xlab("Standardized Mean Differences") +
  ylab("") + 
  ggtitle("Standardized Mean Differences by Covariate") +
  theme_tufte()

# display the graph
graph_std_diff_continuous_cov
Please show me the code!
# save the graph
ggsave(
  graph_std_diff_continuous_cov + labs(title = NULL),
  filename = here::here("inputs", "3.outputs", "2.graphs", "graph_std_diff_continuous_cov.pdf"),
  width = 20,
  height = 6,
  units = "cm",
  device = cairo_pdf
)

On this graph, the black line represents standardized mean differences equal to 0 and the orange line is the 0.1 threshold often used in the matching literature to assess balance. Standardized mean differences below this threshold would indicate good balance. Here, for all covariates and lags, the treatment and control groups are imbalanced.

We display in the table below the values of the standardized mean differences by covariates and lags:

Please show me the code!
data_standardized_difference %>%
  mutate(standardized_difference = round(standardized_difference, 2)) %>%
  arrange(fct_rev(time)) %>%
  rename("Covariate" = "covariate_name", "Time" = "time", "Standardized Mean Difference" = "standardized_difference") %>%
  kable(., align = c("l", "l", "c"))
Covariate Time Standardized Mean Difference
Relative Humidity Lag 0 0.67
NO2 Lag 1 0.35
O3 Lag 1 1.01
NO2 Lag 2 0.26
O3 Lag 2 0.72
NO2 Lag 3 0.15
O3 Lag 3 0.41

Categorical Covariates

For calendar variables such as the day of the week, the month and the year, we evaluate balance by plotting the proportions of days with and without heat wave. If heat waves were randomly distributed, there should not be difference in the distribution of the proportions for the two groups. We first plot the distribution of proportions for the day of the week:

Please show me the code!
# compute the proportions of observations belonging to each wday by treatment status
data_weekday <- data %>%
  dplyr::select(weekday, heat_wave) %>%
  mutate(weekday = str_to_title(weekday)) %>%
  pivot_longer(.,-heat_wave) %>%
  group_by(name, heat_wave, value) %>%
  summarise(n = n()) %>%
  mutate(proportion = round(n / sum(n) * 100, 0)) %>%
  ungroup() %>%
  mutate(
    value = fct_relevel(
      value,
      "Monday",
      "Tuesday",
      "Wednesday",
      "Thursday",
      "Friday",
      "Saturday",
      "Sunday"
    )
  )

# make a dots graph
graph_weekday_balance <- ggplot(data_weekday,
                                aes(
                                  x = as.factor(value),
                                  y = proportion,
                                  colour = heat_wave,
                                  group = heat_wave
                                )) +
  geom_line(size = 0.5, alpha = 0.3) +
  geom_point(size = 2) +
  scale_colour_manual(values = c(my_blue, my_orange),
                      guide = guide_legend(reverse = FALSE)) +
  ggtitle("Proportion of Days with and without Heat Waves by Day of the Week") +
  ylab("Proportion (%)") +
  xlab("Day of the Week") +
  labs(colour = "Group:") +
  theme_tufte() +
  theme(
    legend.position = "top",
    legend.justification = "left",
    legend.direction = "horizontal"
  )

# display the graph
graph_weekday_balance
Please show me the code!
# save the graph
ggsave(
  graph_weekday_balance + labs(title = NULL),
  filename = here::here("inputs", "3.outputs", "2.graphs", "graph_weekday_balance.pdf"),
  width = 16,
  height = 9,
  units = "cm",
  device = cairo_pdf
)

On this graph, we can see that there are some differences (in percentage points) in the distribution of units between the two groups across days of the week. The differences are however small—at most 4 percentages points. We then plot the same graph but for the month indicator:

Please show me the code!
# compute the proportions of observations belonging to each month by treatment status
data_month <- data %>%
  dplyr::select(month, heat_wave) %>%
  mutate(month = str_to_title(month)) %>%
  mutate(month = fct_relevel(month,
                             "June",
                             "July",
                             "August")) %>%
  pivot_longer(., -heat_wave) %>%
  group_by(name, heat_wave, value) %>%
  summarise(n = n()) %>%
  mutate(proportion = round(n / sum(n) * 100, 0)) %>%
  ungroup()

# make a dots graph
graph_month_balance <- ggplot(data_month,
                              aes(
                                x = as.factor(value),
                                y = proportion,
                                colour = heat_wave,
                                group = heat_wave
                              )) +
  geom_line(size = 0.5, alpha = 0.3) +
  geom_point(size = 2) +
  scale_colour_manual(values = c(my_blue, my_orange),
                      guide = guide_legend(reverse = FALSE)) +
  ggtitle("Proportion of Days with and without Heat Waves by Month") +
  ylab("Proportion (%)") +
  xlab("Month") +
  labs(colour = "Group:") +
  theme_tufte()

# display the graph
graph_month_balance
Please show me the code!
# save the graph
ggsave(
  graph_month_balance + labs(title = NULL),
  filename = here::here("inputs", "3.outputs", "2.graphs", "graph_month_balance.pdf"),
  width = 15,
  height = 8,
  units = "cm",
  device = cairo_pdf
)

We also plot the same graph but for the year variable:

Please show me the code!
# compute the proportions of observations belonging to each year by treatment status
data_year <- data %>%
  dplyr::select(year, heat_wave) %>%
  pivot_longer(.,-heat_wave) %>%
  group_by(name, heat_wave, value) %>%
  summarise(n = n()) %>%
  mutate(proportion = round(n / sum(n) * 100, 0)) %>%
  ungroup()

# make dots plot
graph_year_balance <- ggplot(data_year,
         aes(
           x = as.factor(value),
           y = proportion,
           colour = heat_wave,
           group = heat_wave
         )) +
  geom_line(size = 0.5, alpha = 0.3) +
  geom_point(size = 2) +
  scale_colour_manual(values = c(my_blue, my_orange),
                      guide = guide_legend(reverse = FALSE)) +
  ggtitle("Proportion of Days with and without Heat Waves by Year") +
  ylab("Proportion (%)") +
  xlab("Year") +
  labs(colour = "Group:") +
  theme_tufte()

# display the graph
graph_year_balance
Please show me the code!
# save the graph
ggsave(
  graph_year_balance + labs(title = NULL),
  filename = here::here("inputs", "3.outputs", "2.graphs", "graph_year_balance.pdf"),
  width = 15,
  height = 8,
  units = "cm",
  device = cairo_pdf
)

Not surprisingly, we can see on this graph that there were more heat waves on specific years.

To summarize the imbalance for calendar variables, we can finally compute the difference of proportion (in percentage points) between days with and without heat waves. We compute these differences with the following code:

# compute differences in proportion
data_calendar_difference <- data %>%
  dplyr::select(heat_wave, weekday, month, year) %>%
  mutate_all( ~ as.character(.)) %>%
  pivot_longer(cols = -c(heat_wave),
               names_to = "variable",
               values_to = "value") %>%
  mutate(value = str_to_title(value)) %>%
  # group by is_treated, variable and values
  group_by(heat_wave, variable, value) %>%
  # compute the number of observations
  summarise(n = n()) %>%
  # compute the proportion
  mutate(freq = round(n / sum(n) * 100, 0)) %>%
  ungroup() %>%
  mutate(
    calendar_variable = NA %>%
      ifelse(str_detect(variable, "weekday"), "Day of the Week", .) %>%
      ifelse(str_detect(variable, "month"), "Month", .) %>%
      ifelse(str_detect(variable, "year"), "Year", .)
  ) %>%
  dplyr::select(heat_wave, calendar_variable, value, freq) %>%
  pivot_wider(names_from = heat_wave, values_from = freq) %>%
  mutate(abs_difference = abs(`Days with Heat Wave` - `Days without Heat Wave`)) %>%
  # reoder the values of variable for the graph
  mutate(
    value = fct_relevel(
      value,
      "Monday",
      "Tuesday",
      "Wednesday",
      "Thursday",
      "Friday",
      "Saturday",
      "Sunday",
      "June",
      "July",
      "August"
    )
  )

We plot below the differences in proportion for each calendar indicator:

Please show me the code!
# plot the differences in proportion for each calendar indicator
graph_all_calendar_balance <-
  ggplot(data_calendar_difference, aes(x = value, y = abs_difference)) +
  geom_segment(aes(
    x = value,
    xend = value,
    y = 0,
    yend = abs_difference
  ), size = 0.3) +
  geom_point(colour = my_blue, size = 3) +
  scale_y_continuous(breaks = scales::pretty_breaks(n = 8)) +
  facet_wrap( ~ calendar_variable, scales = "free_x", ncol = 1) +
  ggtitle(
    "Absolute Difference in Calendar Indicators Distribution\nBetween Days with and without Heat Waves"
  ) +
  xlab("Calendar Indicator") + ylab("Absolute Difference\n(Percentage Points)") +
  theme_tufte()

# display the graph
graph_all_calendar_balance
Please show me the code!
# save the graph
ggsave(
  graph_all_calendar_balance + labs(title = NULL),
  filename = here::here("inputs", "3.outputs", "2.graphs", "graph_all_calendar_balance.pdf"),
  width = 18,
  height = 15,
  units = "cm",
  device = cairo_pdf
)

We display in the table below the values of the standardized mean differences by covariates and lags:

Please show me the code!
data_calendar_difference %>%
  dplyr::select(calendar_variable, value, abs_difference) %>%
  rename("Calendard Variable" = "calendar_variable", "Value" = "value", "Absolue Difference in % Points" = "abs_difference") %>%
  kable(., align = c("l", "l", "c"))
Calendard Variable Value Absolue Difference in % Points
Month August 5
Month July 6
Month June 1
Day of the Week Friday 3
Day of the Week Monday 3
Day of the Week Saturday 0
Day of the Week Sunday 0
Day of the Week Thursday 2
Day of the Week Tuesday 4
Day of the Week Wednesday 4
Year 1990 2
Year 1991 2
Year 1993 1
Year 1994 3
Year 1995 5
Year 1996 5
Year 1997 4
Year 1998 1
Year 1999 6
Year 2000 5
Year 2001 10
Year 2002 5
Year 2003 0
Year 2004 6
Year 2005 11
Year 2006 2
Year 2007 1
Year 1992 NA

Lack of Common Support

To illustrate the issue of a lack of common support, we plot the YLL against the concentration of ozone in \(t-1\) and color the points according to the heat wave status:

Please show me the code!
# illustrate lack of support
graph_lack_support_1 <- ggplot(data, aes(x = o3_lag_1, y = yll, colour = heat_wave)) +
  geom_point(shape = 16, size = 0.8, alpha = 0.7) +
  scale_colour_manual(values = c(my_orange, my_blue)) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  labs(colour = "Group:") +
  ggtitle("Lack of Support? YLL vs. Ozone in t-1") +
  xlab("O3 in t-1 (µg/m³)") + ylab("YLL") +
  theme_tufte() +
  guides(colour = guide_legend(override.aes = list(size = 3)))

# display the graph
graph_lack_support_1
Please show me the code!
# save the graph
ggsave(
  graph_lack_support_1 + labs(title = NULL),
  filename = here::here("inputs", "3.outputs", "2.graphs", "graph_lack_support_1.pdf"),
  width = 14,
  height = 12,
  units = "cm",
  device = cairo_pdf
)

We can see on this graph that some days without heat waves do not have any similar days with heatwaves in terms of ozone concentrations in \(t-1\). We can also reproduce this figure by year:

Please show me the code!
# illustrate lack of support by year
graph_lack_support_2 <- ggplot(data, aes(x = o3_lag_1, y = yll, colour = heat_wave)) +
  geom_point(shape = 16, alpha = 0.5) +
  scale_colour_manual(values = c(my_orange, my_blue)) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  facet_wrap(~ year) + 
  labs(colour = "Group:") +
  ggtitle("Lack of Support? YLL vs. Ozone in t-1 by Year") +
  xlab("O3 in t-1 (µg/m³)") + ylab("YLL") +
  theme_tufte()

# display the graph
graph_lack_support_2
Please show me the code!
# save the graph
ggsave(
  graph_lack_support_2 + labs(title = NULL),
  filename = here::here("inputs", "3.outputs", "2.graphs", "graph_lack_support_2.pdf"),
  width = 30,
  height = 20,
  units = "cm",
  device = cairo_pdf
)

Again, there is a clear lack of common support within each year.

Finally, we can also visualize the imbalance and lack of common support in the data by predicting, using a simple logistic model, the probability of an heat wave to occur for each day.

Please show me the code!
# predict propensity scores
logit_model <- data %>%
  mutate(heat_wave = ifelse(heat_wave == "Days without Heat Wave", 0, 1)) %>%
  glm(
    heat_wave ~ heat_wave_lag_1 + heat_wave_lag_2 + heat_wave_lag_3 +
      humidity_relative +
      o3_lag_1 + o3_lag_2 + o3_lag_3 +
      no2_lag_1 + no2_lag_2 + no2_lag_3 +
      weekend + month + year,
    family = "binomial",
    data = .
  )

# add predicted probabilities
data <- broom::augment(x = logit_model,
                       newdata = data,
                       type.predict = "response")

# create the graph
graph_lack_support_3 <- data %>%
  ggplot(., aes(x = .fitted, colour = heat_wave)) +
  xlim(0, 1) +
  geom_density() +
  scale_colour_manual(values = c(my_orange, my_blue)) +
  labs(colour = "Group:") +
  ggtitle("Lack of Support? Propensity Score Distributions") +
  xlab("Predicted Probability of Treatment") + ylab("Density") +
  theme_tufte()

# display the graph
graph_lack_support_3
Please show me the code!
# save the graph
ggsave(
  graph_lack_support_3 + labs(title = NULL),
  filename = here::here("inputs", "3.outputs", "2.graphs", "graph_lack_support_3.pdf"),
  width = 14,
  height = 10,
  units = "cm",
  device = cairo_pdf
)

We can see on this graph that the two density distributions do not overlap. We display below the summary statistics of the two distributions:

Please show me the code!
data %>%
  group_by(heat_wave) %>%
  summarise(
    Mean = mean(.fitted),
    SD = sd(.fitted),
    Min = min(.fitted),
    Max = max(.fitted)
  ) %>%
  mutate_at(vars(-heat_wave), ~ round(., 2)) %>%
  kable(., align = c("l", "c", "c", "c"))
heat_wave Mean SD Min Max
Days with Heat Wave 0.31 0.24 0.01 0.90
Days without Heat Wave 0.07 0.11 0.00 0.81

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.