Selection Rate II weighting (Zagheni & Weber 2012) with k calibration
Source:R/adjust_selection_rate2.R
adjust_selection_rate2.RdNotation used throughout:
\(P_i^{(O)}, P_j^{(D)}\): population at origin \(i\) and destination \(j\)
\(U_i^{(O)}, U_j^{(D)}\): active users at origin \(i\) and destination \(j\)
\(p_i^{(O)} = U_i^{(O)}/P_i^{(O)}\) and \(p_j^{(D)} = U_j^{(D)}/P_j^{(D)}\): penetration
\(F_{ij}^{mpd}\) and \(F_{ij}^{adj}\): observed and adjusted flows
\(k > 0\): selection-rate curvature parameter
Usage
adjust_selection_rate2(
mpd_od_df,
coverage_df,
weight_by = c("origin", "destination", "both"),
group_cols = NULL,
k = NULL,
k_grid = seq(0.1, 5, by = 0.1),
benchmark_od_df = NULL,
flow_col_bench = "flow",
calibration_aggregate = c("origin", "od"),
clip_min = 0,
clip_max = Inf,
keep_cols = character()
)Arguments
- mpd_od_df
Data frame with at least: origin, destination, flow, mpd_source; plus group_cols if used.
- coverage_df
Data frame with: NEW: origin, origin_population, origin_user_count, destination, destination_population, destination_user_count, mpd_source or LEGACY: origin, population, user_count, mpd_source. Must contain group_cols if used.
- weight_by
"origin", "destination", or "both".
- group_cols
Optional character vector of stratification variables present in both mpd_od_df and coverage_df (e.g. c("age_group","sex")).
- k
Optional positive scalar. If NULL and benchmark_od_df supplied, k is calibrated by grid search. If NULL and no benchmark, k = 1.
- k_grid
Grid of candidate k values for calibration when k is NULL. Default seq(0.1, 5, by = 0.1).
- benchmark_od_df
Optional benchmark OD for calibrating k. Must contain origin, destination, and a flow column.
- flow_col_bench
Name of benchmark flow column. Default "flow".
- calibration_aggregate
"origin" (default, compare origin totals) or "od" (compare OD flows directly).
- clip_min
Lower bound used to clamp weights. Default 0.
- clip_max
Upper bound used to clamp weights. Default Inf.
- keep_cols
Extra columns from mpd_od_df to retain.
Value
Tibble with: origin, destination, mpd_source, (group_cols), flow, weight_origin, weight_destination, weight_missing, flow_adj. Attributes: - "k" : numeric k used. - "k_calibration" : data.frame of k vs loss (if calibrated).
Details
Implements the "Selection Rate II" correction: $$CF(p; k) = p \frac{e^{-k} - 1}{e^{-k p} - 1}$$
where \(p\) is a penetration rate between 0 and 1 (e.g. Internet or platform penetration), and k > 0 controls how strongly selection bias increases as p decreases.
For OD flows:
weight_by = "origin": \(w_{ij} = CF(p_i^{(O)}; k)\)weight_by = "destination": \(w_{ij} = CF(p_j^{(D)}; k)\)weight_by = "both": \(w_{ij} = \sqrt{CF(p_i^{(O)}; k)\,CF(p_j^{(D)}; k)}\)
Adjusted flows are: $$F_{ij}^{adj} = F_{ij}^{mpd} \times w_{ij}$$
Supports:
Location-only OD (default).
Stratified OD (e.g. age/sex) via group_cols; then p and CF are computed by (area, group_cols).
Calibration of k:
If k is provided (numeric scalar): use it.
Else if benchmark_od_df is provided: search k_grid, pick k* minimising sum of absolute errors between adjusted and benchmark flows (origin-aggregated or OD-level).
Else: default k = 1.