Selection-rate weighting (Chi et al. with r_t calibration)
Source:R/adjust_selection_rate.R
adjust_selection_rate.RdImplements Chi et al.-style selection weights.
Usage
adjust_selection_rate(
mpd_od_df,
coverage_df,
covariates_df = NULL,
covariate_col = NULL,
weight_by = c("origin", "destination", "both"),
r_global = NULL,
benchmark_od_df = NULL,
flow_col_bench = "flow",
r_grid = seq(0, 3, by = 0.01),
calibration_aggregate = c("origin", "od"),
clip_min = 0,
clip_max = Inf,
keep_cols = character(),
...
)Arguments
- mpd_od_df
Data frame with: origin, destination, flow, mpd_source.
- coverage_df
Data frame in NEW or LEGACY schema.
- covariates_df
Optional data frame with columns:
area
a numeric area-level covariate column
optional mpd_source
- covariate_col
Optional name of the numeric area-level covariate column in
covariates_df. The selected covariate is normalized to the 0-1 range internally before weights are computed. If NULL, auto-detect among legacy income-like names:income_norm, gni_pc, income, inc, gni, income_2000, income_2010.- weight_by
"origin", "destination", or "both".
- r_global
Optional scalar r_t. Overrides calibration if given.
- benchmark_od_df
Optional OD benchmark for calibrating r_t. Must contain: origin, destination, and a flow column.
- flow_col_bench
Name of benchmark flow column in
benchmark_od_df. Default "flow".- r_grid
Numeric vector of candidate r_t values for calibration. Default
seq(0, 3, by = 0.01)as in Chi et al.- calibration_aggregate
"origin" (default, Chi et al. style) or "od".
- clip_min
Lower bound used to clamp weights. Default 0.
- clip_max
Upper bound used to clamp weights. Default Inf.
- keep_cols
Extra columns from
mpd_od_dfto keep.- ...
Deprecated arguments.
income_colis accepted here as a legacy alias forcovariate_col.
Value
Tibble with: origin, destination, mpd_source, flow, weight_origin, weight_destination, weight_missing, flow_adj. Attributes: - "r_global": numeric r_t used. - "r_calibration": data.frame of (r, loss) if calibration was run.
Details
Notation used throughout:
\(P_i^{(O)}, P_j^{(D)}\): population at origin \(i\) and destination \(j\)
\(U_i^{(O)}, U_j^{(D)}\): active users at origin \(i\) and destination \(j\)
\(p_i^{(O)} = U_i^{(O)}/P_i^{(O)}\) and \(p_j^{(D)} = U_j^{(D)}/P_j^{(D)}\): penetration
\(I_i, I_j \in [0,1]\): normalized selected covariate
\(r_t > 0\): global selection parameter
\(F_{ij}^{mpd}\) and \(F_{ij}^{adj}\): observed and adjusted flows
Origin-side weight: $$w_i^{(O)}(r_t) = \frac{1}{I_i p_i^{(O)} + (1 - I_i) r_t}$$
Destination-side weight: $$w_j^{(D)}(r_t) = \frac{1}{I_j p_j^{(D)} + (1 - I_j) r_t}$$
Flow adjustment:
$$F_{ij}^{adj} = F_{ij}^{mpd} \times w_{ij}$$
with \(w_{ij} = w_i^{(O)}\) for weight_by = "origin",
\(w_{ij} = w_j^{(D)}\) for weight_by = "destination", and
\(w_{ij} = \sqrt{w_i^{(O)} w_j^{(D)}}\) for weight_by = "both".
with flexible area-level covariates and optional calibration of r_t against benchmark OD flows.
Coverage formats: NEW: origin, origin_population, origin_user_count, destination, destination_population, destination_user_count, mpd_source LEGACY: origin, population, user_count, mpd_source
Optional covariates:
covariates_df with:
area,
a numeric area-level covariate column (specified via covariate_col),
optional mpd_source.
r_t handling:
If
r_globalis provided: use it directly.Else if
benchmark_od_dfis provided: calibrate r_t by grid search to minimize the sum of absolute errors, reproducing Chi et al.'s idea. By default aggregates by origin: $$r_t^\ast = \arg\min_{r \in \mathcal{R}} \sum_i \left|\hat{M}_i(r) - M_i^{bench}\right|$$Else: fall back to descriptive r_t = sum(U) / sum(P) (transparent).