Selection Rate II weighting (Zagheni & Weber 2012) with k calibration

Notation used throughout:

$P_i^{(O)}, P_j^{(D)}$: population at origin $i$ and destination $j$
$U_i^{(O)}, U_j^{(D)}$: active users at origin $i$ and destination $j$
$p_i^{(O)} = U_i^{(O)}/P_i^{(O)}$ and $p_j^{(D)} = U_j^{(D)}/P_j^{(D)}$: penetration
$F_{ij}^{mpd}$ and $F_{ij}^{adj}$: observed and adjusted flows
$k > 0$: selection-rate curvature parameter

Usage

adjust_selection_rate2(
  mpd_od_df,
  coverage_df,
  weight_by = c("origin", "destination", "both"),
  group_cols = NULL,
  k = NULL,
  k_grid = seq(0.1, 5, by = 0.1),
  benchmark_od_df = NULL,
  flow_col_bench = "flow",
  calibration_aggregate = c("origin", "od"),
  clip_min = 0,
  clip_max = Inf,
  keep_cols = character()
)

Arguments

mpd_od_df: Data frame with at least: origin, destination, flow, mpd_source; plus group_cols if used.
coverage_df: Data frame with: NEW: origin, origin_population, origin_user_count, destination, destination_population, destination_user_count, mpd_source or LEGACY: origin, population, user_count, mpd_source. Must contain group_cols if used.
weight_by: "origin", "destination", or "both".
group_cols: Optional character vector of stratification variables present in both mpd_od_df and coverage_df (e.g. c("age_group","sex")).
k: Optional positive scalar. If NULL and benchmark_od_df supplied, k is calibrated by grid search. If NULL and no benchmark, k = 1.
k_grid: Grid of candidate k values for calibration when k is NULL. Default seq(0.1, 5, by = 0.1).
benchmark_od_df: Optional benchmark OD for calibrating k. Must contain origin, destination, and a flow column.
flow_col_bench: Name of benchmark flow column. Default "flow".
calibration_aggregate: "origin" (default, compare origin totals) or "od" (compare OD flows directly).
clip_min: Lower bound used to clamp weights. Default 0.
clip_max: Upper bound used to clamp weights. Default Inf.
keep_cols: Extra columns from mpd_od_df to retain.

Value

Tibble with: origin, destination, mpd_source, (group_cols), flow, weight_origin, weight_destination, weight_missing, flow_adj. Attributes: - "k" : numeric k used. - "k_calibration" : data.frame of k vs loss (if calibrated).

Details

Implements the "Selection Rate II" correction: $$CF(p; k) = p \frac{e^{-k} - 1}{e^{-k p} - 1}$$

where $p$ is a penetration rate between 0 and 1 (e.g. Internet or platform penetration), and k > 0 controls how strongly selection bias increases as p decreases.

For OD flows:

weight_by = "origin": $w_{ij} = CF(p_i^{(O)}; k)$
weight_by = "destination": $w_{ij} = CF(p_j^{(D)}; k)$
weight_by = "both": $w_{ij} = \sqrt{CF(p_i^{(O)}; k)\,CF(p_j^{(D)}; k)}$

Adjusted flows are: $$F_{ij}^{adj} = F_{ij}^{mpd} \times w_{ij}$$

Supports:

Location-only OD (default).
Stratified OD (e.g. age/sex) via group_cols; then p and CF are computed by (area, group_cols).

Calibration of k:

If k is provided (numeric scalar): use it.
Else if benchmark_od_df is provided: search k_grid, pick k* minimising sum of absolute errors between adjusted and benchmark flows (origin-aggregated or OD-level).
Else: default k = 1.