Comparing days with Wind Blowing from the North-East to Other Directions. Adjusting for calendar indicators and other weather variables.
In this document, we provide all steps required to reproduce our matching procedure. We compare days where:
We adjust for calendar indicators and weather confouding factors.
Should you have any questions, need help to reproduce the analysis or find coding errors, please do not hesitate to contact me at leo.zabrocki@psemail.eu
To reproduce exactly the script_matching_procedure.html
document, we first need to have installed:
script_matching_procedure.Rmd
file and interact with the R code chunksOnce everything is set up, we have to load the following packages:
# load required packages
library(knitr) # for creating the R Markdown document
library(here) # for files paths organization
library(tidyverse) # for data manipulation and visualization
library(Rcpp) # for running the matching algorithm
library(optmatch) # for matching pairs
library(igraph) # for pair matching via bipartite maximal weighted matching
We load our custom ggplot2
theme for graphs:
We also have to load the script_time_series_matching_function.R
located in the 0.script_matching_algorithm folder and which provides the functions used for matching time series:
First, we load the data:
For each covariate, we create the first daily lags and leads and create a new dataframe called processed_data
:
# create first daily lead for each variable
data_leads <- data %>%
select(date, mean_no2:wind_direction_categories) %>%
mutate_at(vars(-date), ~ lead(., n = 1, order_by = date)) %>%
rename_at(vars(-date), function(x)
paste0(x, "_lead_", 1))
# create first daily lag for each variable
data_lags <- data %>%
select(date, mean_no2:wind_direction_categories) %>%
mutate_at(vars(-date), ~ lag(., n = 1, order_by = date)) %>%
rename_at(vars(-date), function(x)
paste0(x, "_lag_", 1))
# create processed_data
processed_data <- left_join(data, data_lags, by = "date") %>%
left_join(., data_leads, by = "date")
We can now define the hypothetical experiment that we would like to investigate.
We defined our potential experiments such that:
Below are the required steps to select the corresponding treated and control units whose observations are stored in the matching_data:
# construct treatment assignment variable
processed_data <- processed_data %>%
mutate(is_treated = ifelse(wind_direction_categories == "North-East", TRUE, FALSE),
is_treated_lag_1 = lag(is_treated, n = 1, order_by = date))
# remove the days for which assignment is undefined
matching_data = processed_data[!is.na(processed_data$is_treated),]
# susbet treated and control units
treated_units = subset(matching_data, is_treated)
control_units = subset(matching_data,!is_treated)
N_treated = nrow(treated_units)
N_control = nrow(control_units)
There are 912 treated units and 3106 control units. We display the distribution of treated and control units through time:
# make stripes graph
graph_stripes_wd_experiment <- matching_data %>%
mutate(is_treated = ifelse(is_treated == "TRUE", "Treated", "Control")) %>%
ggplot(., aes(x = date, y = 1, fill = is_treated)) +
geom_tile() +
scale_x_date(breaks = scales::pretty_breaks(n = 10)) +
scale_y_continuous(expand = c(0, 0)) +
scale_fill_manual(name = "Daily Observations:", values = c(my_blue, my_orange)) +
xlab("Date") +
theme_tufte() +
theme(
panel.grid.major.y = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank()
)
# display the graph
graph_stripes_wd_experiment
# save the graph
ggsave(
graph_stripes_wd_experiment,
filename = here::here(
"inputs", "3.outputs",
"2.matching_analysis",
"graph_stripes_wd_experiment.pdf"
),
width = 30,
height = 10,
units = "cm",
device = cairo_pdf
)
We save the matching_data
:
Below is the code to define the relevant thresholds:
# we create the scaling list as it is needed for running the algorithm
# but we do not use it
scaling = rep(list(1), ncol(matching_data))
names(scaling) = colnames(matching_data)
# instead, we manually defined the threshold for each covariate
thresholds = rep(list(Inf), ncol(matching_data))
names(thresholds) = colnames(matching_data)
# threshold for julian date
thresholds$julian_date = 60
# threshold for weekend
thresholds$weekend = 0
# threshold for holidays
thresholds$holidays_dummy = 0
# threshold for bank days
thresholds$bank_day_dummy = 0
# thresholds for average temperature
thresholds$temperature_average = 5
# thresholds for average humidity
thresholds$humidity_average = 12
# threshold for wind speed
thresholds$wind_speed = 0.5
# for lag of treatment indicator
thresholds$is_treated_lag_1 = 0
# threshold for rainfall duration
thresholds$rainfall_duration = 0
# thresholds for pm10 in t-1
thresholds$mean_pm10_lag_1 = 8
We compute discrepancy matrix and run the matching algorithm:
# first we compute the discrepancy matrix
discrepancies = discrepancyMatrix(treated_units, control_units, thresholds, scaling)
# convert matching data to data.frame
matching_data <- as.data.frame(matching_data)
rownames(discrepancies) = format(matching_data$date[which(matching_data$is_treated)], "%Y-%m-%d")
colnames(discrepancies) = format(matching_data$date[which(!matching_data$is_treated)], "%Y-%m-%d")
rownames(matching_data) = matching_data$date
# run the fullmatch algorithm
matched_groups = fullmatch(
discrepancies,
data = matching_data,
remove.unmatchables = TRUE,
max.controls = 1
)
# get list of matched treated-control groups
groups_labels = unique(matched_groups[!is.na(matched_groups)])
groups_list = list()
for (i in 1:length(groups_labels)) {
IDs = names(matched_groups)[(matched_groups == groups_labels[i])]
groups_list[[i]] = as.Date(IDs[!is.na(IDs)])
}
For somes cases, several controls units were matched to a treatment unit. We use the igraph
package to force pair matching via bipartite maximal weighted matching. Below is the required code:
# we build a bipartite graph with one layer of treated nodes, and another layer of control nodes.
# the nodes are labeled by integers from 1 to (N_treated + N_control)
# by convention, the first N_treated nodes correspond to the treated units, and the remaining N_control
# nodes correspond to the control units.
# build pseudo-adjacency matrix: edge if and only if match is admissible
# NB: this matrix is rectangular so it is not per say the adjacendy matrix of the graph
# (for this bipartite graph, the adjacency matrix had four blocks: the upper-left block of size
# N_treated by N_treated filled with 0's, bottom-right block of size N_control by N_control filled with 0's,
# top-right block of size N_treated by N_control corresponding to adj defined below, and bottom-left block
# of size N_control by N_treated corresponding to the transpose of adj)
adj = (discrepancies < Inf)
# extract endpoints of edges
edges_mat = which(adj, arr.ind = TRUE)
# build weights, listed in the same order as the edges (we use a decreasing function x --> 1/(1+x) to
# have weights inversely proportional to the discrepancies, since maximum.bipartite.matching
# maximizes the total weight and we want to minimize the discrepancy)
weights = 1 / (1 + sapply(1:nrow(edges_mat), function(i)
discrepancies[edges_mat[i, 1], edges_mat[i, 2]]))
# format list of edges (encoded as a vector resulting from concatenating the end points of each edge)
# i.e c(edge1_endpoint1, edge1_endpoint2, edge2_endpoint1, edge2_endpoint1, edge3_endpoint1, etc...)
edges_mat[, "col"] = edges_mat[, "col"] + N_treated
edges_vector = c(t(edges_mat))
# NB: by convention, the first N_treated nodes correspond to the treated units, and the remaining N_control
# nodes correspond to the control units (hence the "+ N_treated" to shift the labels of the control nodes)
# build the graph from the list of edges
BG = make_bipartite_graph(c(rep(TRUE, N_treated), rep(FALSE, N_control)), edges = edges_vector)
# find the maximal weighted matching
MBM = maximum.bipartite.matching(BG, weights = weights)
# list the dates of the matched pairs
pairs_list = list()
N_matched = 0
for (i in 1:N_treated) {
if (!is.na(MBM$matching[i])) {
N_matched = N_matched + 1
pairs_list[[N_matched]] = c(treated_units$date[i], control_units$date[MBM$matching[i] -
N_treated])
}
}
# transform the list of matched pairs to a dataframe
matched_pairs <- enframe(pairs_list) %>%
unnest(cols = "value") %>%
rename(pair_number = name,
date = value)
The hypothetical experiment we set up had 912 treated units and 3106 control units. The matching procedure results in 169 matched treated units.
One issue with our matching procedure is that matched pairs can be temporarily too close: this would violate the Stable Unit Treatment Value Assumption (STUVA), which states that the potential outcomes of each unit is independent from the potential outcomes from other units. Another way to put it is that there is no interference between treated and control units.
We compute below the temporal distance in days between treated and control units for each pair:
# compute temporal distance within pairs
pair_temporal_distance <- matched_pairs %>%
group_by(pair_number) %>%
summarise(date_difference = abs(date[2]-date[1])) %>%
ungroup()
# summary statistics
pair_temporal_distance %>%
summarise(Mean = mean(date_difference),
SD = sd(date_difference),
Min = min(date_difference),
Max = max(date_difference)) %>%
mutate_all(~ round(., 0)) %>%
kable(., align = c(rep("c", 4)))
Mean | SD | Min | Max |
---|---|---|---|
13 days | 14 | 1 days | 59 days |
There are exactly 48 pairs for which the absolute difference in days is less or equal to 3. We therefore decided to drop these pairs to make the STUVA more credible:
The resulting number of matched pairs is equal to 121.
We finally merge the matched_pairs
with the matching_matching_data
to retrieve covariates values for the matched pairs and save the data:
If you see mistakes or want to suggest changes, please create an issue on the source repository.