Skip to contents

Notation used throughout:

  • \(P_i^{(O)}, P_j^{(D)}\): population at origin \(i\) and destination \(j\)

  • \(U_i^{(O)}, U_j^{(D)}\): active users at origin \(i\) and destination \(j\)

  • \(p_i^{(O)} = U_i^{(O)}/P_i^{(O)}\) and \(p_j^{(D)} = U_j^{(D)}/P_j^{(D)}\): penetration

  • \(F_{ij}^{mpd}\) and \(F_{ij}^{adj}\): observed and adjusted flows

  • \(k > 0\): selection-rate curvature parameter

Usage

adjust_selection_rate2(
  mpd_od_df,
  coverage_df,
  weight_by = c("origin", "destination", "both"),
  group_cols = NULL,
  k = NULL,
  k_grid = seq(0.1, 5, by = 0.1),
  benchmark_od_df = NULL,
  flow_col_bench = "flow",
  calibration_aggregate = c("origin", "od"),
  clip_min = 0,
  clip_max = Inf,
  keep_cols = character()
)

Arguments

mpd_od_df

Data frame with at least: origin, destination, flow, mpd_source; plus group_cols if used.

coverage_df

Data frame with: NEW: origin, origin_population, origin_user_count, destination, destination_population, destination_user_count, mpd_source or LEGACY: origin, population, user_count, mpd_source. Must contain group_cols if used.

weight_by

"origin", "destination", or "both".

group_cols

Optional character vector of stratification variables present in both mpd_od_df and coverage_df (e.g. c("age_group","sex")).

k

Optional positive scalar. If NULL and benchmark_od_df supplied, k is calibrated by grid search. If NULL and no benchmark, k = 1.

k_grid

Grid of candidate k values for calibration when k is NULL. Default seq(0.1, 5, by = 0.1).

benchmark_od_df

Optional benchmark OD for calibrating k. Must contain origin, destination, and a flow column.

flow_col_bench

Name of benchmark flow column. Default "flow".

calibration_aggregate

"origin" (default, compare origin totals) or "od" (compare OD flows directly).

clip_min

Lower bound used to clamp weights. Default 0.

clip_max

Upper bound used to clamp weights. Default Inf.

keep_cols

Extra columns from mpd_od_df to retain.

Value

Tibble with: origin, destination, mpd_source, (group_cols), flow, weight_origin, weight_destination, weight_missing, flow_adj. Attributes: - "k" : numeric k used. - "k_calibration" : data.frame of k vs loss (if calibrated).

Details

Implements the "Selection Rate II" correction: $$CF(p; k) = p \frac{e^{-k} - 1}{e^{-k p} - 1}$$

where \(p\) is a penetration rate between 0 and 1 (e.g. Internet or platform penetration), and k > 0 controls how strongly selection bias increases as p decreases.

For OD flows:

  • weight_by = "origin": \(w_{ij} = CF(p_i^{(O)}; k)\)

  • weight_by = "destination": \(w_{ij} = CF(p_j^{(D)}; k)\)

  • weight_by = "both": \(w_{ij} = \sqrt{CF(p_i^{(O)}; k)\,CF(p_j^{(D)}; k)}\)

Adjusted flows are: $$F_{ij}^{adj} = F_{ij}^{mpd} \times w_{ij}$$

Supports:

  1. Location-only OD (default).

  2. Stratified OD (e.g. age/sex) via group_cols; then p and CF are computed by (area, group_cols).

Calibration of k:

  • If k is provided (numeric scalar): use it.

  • Else if benchmark_od_df is provided: search k_grid, pick k* minimising sum of absolute errors between adjusted and benchmark flows (origin-aggregated or OD-level).

  • Else: default k = 1.