Selection-rate weighting (Chi et al. with r_t calibration)

Implements Chi et al.-style selection weights.

Usage

adjust_selection_rate(
  mpd_od_df,
  coverage_df,
  covariates_df = NULL,
  covariate_col = NULL,
  weight_by = c("origin", "destination", "both"),
  r_global = NULL,
  benchmark_od_df = NULL,
  flow_col_bench = "flow",
  r_grid = seq(0, 3, by = 0.01),
  calibration_aggregate = c("origin", "od"),
  clip_min = 0,
  clip_max = Inf,
  keep_cols = character(),
  ...
)

Arguments

mpd_od_df

Data frame with: origin, destination, flow, mpd_source.

coverage_df

Data frame in NEW or LEGACY schema.

covariates_df

Optional data frame with columns:

area
a numeric area-level covariate column
optional mpd_source

covariate_col

Optional name of the numeric area-level covariate column in covariates_df. The selected covariate is normalized to the 0-1 range internally before weights are computed. If NULL, auto-detect among legacy income-like names: income_norm, gni_pc, income, inc, gni, income_2000, income_2010.

weight_by

"origin", "destination", or "both".

r_global

Optional scalar r_t. Overrides calibration if given.

benchmark_od_df

Optional OD benchmark for calibrating r_t. Must contain: origin, destination, and a flow column.

flow_col_bench

Name of benchmark flow column in benchmark_od_df. Default "flow".

r_grid

Numeric vector of candidate r_t values for calibration. Default seq(0, 3, by = 0.01) as in Chi et al.

calibration_aggregate

"origin" (default, Chi et al. style) or "od".

clip_min

Lower bound used to clamp weights. Default 0.

clip_max

Upper bound used to clamp weights. Default Inf.

keep_cols

Extra columns from mpd_od_df to keep.

...

Deprecated arguments. income_col is accepted here as a legacy alias for covariate_col.

Value

Tibble with: origin, destination, mpd_source, flow, weight_origin, weight_destination, weight_missing, flow_adj. Attributes: - "r_global": numeric r_t used. - "r_calibration": data.frame of (r, loss) if calibration was run.

Details

Notation used throughout:

$P_i^{(O)}, P_j^{(D)}$: population at origin $i$ and destination $j$
$U_i^{(O)}, U_j^{(D)}$: active users at origin $i$ and destination $j$
$p_i^{(O)} = U_i^{(O)}/P_i^{(O)}$ and $p_j^{(D)} = U_j^{(D)}/P_j^{(D)}$: penetration
$I_i, I_j \in [0,1]$: normalized selected covariate
$r_t > 0$: global selection parameter
$F_{ij}^{mpd}$ and $F_{ij}^{adj}$: observed and adjusted flows

Origin-side weight: $$w_i^{(O)}(r_t) = \frac{1}{I_i p_i^{(O)} + (1 - I_i) r_t}$$

Destination-side weight: $$w_j^{(D)}(r_t) = \frac{1}{I_j p_j^{(D)} + (1 - I_j) r_t}$$

Flow adjustment: $$F_{ij}^{adj} = F_{ij}^{mpd} \times w_{ij}$$ with $w_{ij} = w_i^{(O)}$ for weight_by = "origin", $w_{ij} = w_j^{(D)}$ for weight_by = "destination", and $w_{ij} = \sqrt{w_i^{(O)} w_j^{(D)}}$ for weight_by = "both".

with flexible area-level covariates and optional calibration of r_t against benchmark OD flows.

Coverage formats: NEW: origin, origin_population, origin_user_count, destination, destination_population, destination_user_count, mpd_source LEGACY: origin, population, user_count, mpd_source

Optional covariates: covariates_df with: area, a numeric area-level covariate column (specified via covariate_col), optional mpd_source.

r_t handling:

If r_global is provided: use it directly.
Else if benchmark_od_df is provided: calibrate r_t by grid search to minimize the sum of absolute errors, reproducing Chi et al.'s idea. By default aggregates by origin: $$r_t^\ast = \arg\min_{r \in \mathcal{R}} \sum_i \left|\hat{M}_i(r) - M_i^{bench}\right|$$
Else: fall back to descriptive r_t = sum(U) / sum(P) (transparent).