Bayesian Multilevel Bias Adjustment for OD Flows (v0.2 Stage 1)
Source:R/adjust_multilevel_bayes.R
adjust_multilevel_bayes.RdEstimates bias-adjusted OD flows from MPD counts using a flexible Bayesian multilevel count model. The observed-OD mode corrects observed MPD flows. The complete-grid mode fits to originally observed source rows and predicts adjusted flows for every row in a supplied square OD grid.
Usage
adjust_multilevel_bayes(
mpd_od_df,
coverage_df,
covariates_df = NULL,
distance_df = NULL,
flow_col = "flow",
income_col = NULL,
pop_col = "population",
distance_col = "distance_km",
source_col = NULL,
time_col = NULL,
scenario = c("auto", "s1", "s2", "s3", "s4"),
repeated_observation = c("auto", "none", "time", "source", "source_time"),
random_intercept = c("origin", "destination", "od", "source", "time", "source_time",
"none"),
formula = NULL,
custom_formula = NULL,
mobility_formula = NULL,
bias_formula = NULL,
model_family = c("poisson", "negbin", "zip", "zinb"),
model_engine = c("bayesian", "frequentist"),
backend = c("auto", "rstanarm", "brms", "stan_latent"),
flow_adj_summary = c("mean", "median"),
target_scale = c("mpd_counterfactual", "true_flow"),
observation_model = c("reduced_form", "coverage_offset", "latent_two_level"),
coverage_scale = c("origin", "destination", "both"),
latent_flow_unit = c("auto", "od", "od_time"),
latent_coef_prior_scale = 1,
latent_bias_prior_scale = 0.5,
latent_intercept_prior_scale = 2.5,
latent_state_prior_scale = 0.75,
latent_source_prior_scale = 0.5,
latent_time_prior_scale = 0.5,
latent_phi_prior_rate = 1,
latent_adapt_delta = 0.95,
latent_max_treedepth = 12,
latent_rng_eta_max = 20,
prediction_scope = c("observed", "complete_grid"),
iter = 1000,
chains = 2,
seed = 123,
refresh = 0,
include_flow_adj_draws = FALSE,
keep_cols = character()
)Arguments
- mpd_od_df
Data frame with at least
origin,destination, andflow_col. Optionalmpd_sourceis carried through.- coverage_df
Data frame with at least
origin,population,user_count. Optionaldestinationandmpd_sourceenable destination/source-specific joins.- covariates_df
Optional area-level covariates with at least
area. Each additional column is made available toformulatwice using origin and destination suffixes, for examplerural_pct_oandrural_pct_d. Ifpop_colexists, it is used for population covariates; otherwise populations are derived fromcoverage_df. Set toNULLwhenformulauses only OD, distance, coverage, source, or time fields.- distance_df
Optional OD distance data frame with
origin,destination, anddistance_col.- flow_col
Name of MPD flow column in
mpd_od_df. Default"flow".- income_col
Deprecated. Optional name of the area-level covariate to use in the default formula when
formula = NULL. Prefer specifying covariates directly informula.- pop_col
Optional population column in
covariates_df. Default"population". If missing, a population proxy is built fromcoverage_df.- distance_col
Name of distance column in
distance_df(or inmpd_od_dfifdistance_df = NULL). Default"distance_km".- source_col
Optional mobile-phone data source column in
mpd_od_df. IfNULLandmpd_sourceexists, that column is used. Source metadata supports S3 and S4 inputs.- time_col
Optional observation-period column in
mpd_od_df. Time metadata supports S2 and S4 inputs.- scenario
Input scenario contract:
"auto","s1","s2","s3", or"s4". S1 is single source/single time, S2 is single source/multiple times, S3 is multiple sources/single time, and S4 is multiple sources/multiple times."auto"infers the scenario from the supplied source and time columns.- repeated_observation
Repeated-observation structure:
"auto","none","time","source", or"source_time"."auto"maps fromscenario.- random_intercept
Random-intercept structure:
"origin","destination","od","source","time","source_time", or"none". Used only whenformulais not supplied; formula random-effect terms such as(1 | origin)or(1 + log_distance | origin)take precedence.- formula
Optional model formula using the prepared model data. The response should usually be
flow. The formula can include any prepared covariate columns,log_distance,bias_e_origin, scenario columns such asmpd_sourceandmpd_time, and random-effect terms supported by the chosen engine.- custom_formula
Deprecated alias for
formula.- mobility_formula
Optional one-sided formula for the conceptual true-flow component, for example
~ rural_pct_o + rural_pct_d + log_distance + (1 | origin). Supply together withbias_formula; do not combine withformulaorcustom_formula.- bias_formula
Optional one-sided formula for the conceptual MPD observation-bias component, for example
~ bias_e_origin. Supply together withmobility_formula. Fixed-effect variables in this formula must be numeric, integer, or logical because the adjusted counterfactual sets them to zero.- model_family
Count family:
"poisson","negbin","zip", or"zinb".- model_engine
Fitting engine.
"bayesian"uses the Bayesian backend for S1-S4 source/time scenarios;"frequentist"uses a faster Poisson/negative-binomial GLM/GLMM scaffold for testing, experimentation, and runtime-sensitive comparisons.- backend
Bayesian backend:
"auto","rstanarm","brms", or"stan_latent"."auto"chooses rstanarm for Poisson/NegBin reduced-form and coverage-offset models, brms for zero-inflated families, and the package's custom Stan backend forobservation_model = "latent_two_level".- flow_adj_summary
Summary for posterior draw-level adjusted flows:
"mean"or"median". This controls how the draw-level adjusted flows are collapsed into the returnedflow_adjcolumn.- target_scale
Adjustment target.
"mpd_counterfactual"preserves the reduced-form MPD-scale counterfactual."true_flow"estimates a true-flow scale by using coverage as an observation-process offset.- observation_model
Observation model.
"reduced_form"preserves the existing fitted-bias-covariate path."coverage_offset"uses active-user coverage as a fixed offset and is required fortarget_scale = "true_flow"."latent_two_level"fits an experimental joint latent model with a source-invariant true-flow state and source/time-specific MPD observation process.- coverage_scale
Coverage rate used by
"coverage_offset":"origin"uses \(c_i\),"destination"uses \(c_j\), and"both"uses \(\sqrt{c_i c_j}\).- latent_flow_unit
Latent state used when
observation_model = "latent_two_level"."od"shares one latent flow across source/time observations of an OD pair;"od_time"shares one latent flow across sources within each OD-time cell."auto"chooses"od_time"for S2/S4 and"od"otherwise.- latent_coef_prior_scale, latent_bias_prior_scale
Positive prior scales used by the custom
"stan_latent"backend for true-flow and observation-bias coefficients. These controls are ignored by"rstanarm"and"brms"backends.- latent_intercept_prior_scale
Positive prior scale for the true-flow intercept in the custom
"stan_latent"backend.- latent_state_prior_scale, latent_source_prior_scale, latent_time_prior_scale
Positive half-normal prior scales used by the custom
"stan_latent"backend for latent OD/OD-time, source, and time standard deviations.- latent_phi_prior_rate
Positive exponential prior rate for the negative-binomial overdispersion parameter in the custom
"stan_latent"backend.- latent_adapt_delta
Target acceptance probability passed to
rstan::sampling()for the custom"stan_latent"backend.- latent_max_treedepth
Maximum tree depth passed to
rstan::sampling()for the custom"stan_latent"backend.- latent_rng_eta_max
Upper clamp for generated-quantity log-rate exponentiation and RNG calls in the custom
"stan_latent"backend. The likelihood and log-likelihood are not clamped; this prevents overflow in returned posterior prediction draws and replicated-count RNGs.- prediction_scope
Prediction contract.
"observed"preserves the original observed-row workflow."complete_grid"requires a strict square OD grid and predicts for all valid rows, while fitting on rows marked as originally observed whenmpd_observedis present.- iter
Number of sampling iterations. Default
1000.- chains
Number of MCMC chains. Default
2.- seed
Random seed. Default
123.- refresh
Sampler progress refresh. Default
0.- include_flow_adj_draws
Logical; if
TRUE, attach posterior draw matrix as attribute"flow_adj_draws". These are the draw-level adjusted observed flows underlying the returnedflow_adjsummary. DefaultFALSE.- keep_cols
Optional extra columns from
mpd_od_dfto keep.
Value
A tibble with identifiers, original flow, adjusted
flow_adj, source row-status fields, bias_e_origin, and
log-distance helpers. In prediction_scope = "observed",
flow_adj is returned for observed rows in mpd_od_df. In
prediction_scope = "complete_grid", flow_adj is returned for
every valid row in the supplied square OD grid.
Attributes:
"model": fitted model object"formula": fitted formula"coefficients": fixed-effect summary table"model_engine":"bayesian"or"frequentist""backend": backend used"model_family": fitted family"model_terms": resolved fixed-effect and random-effect terms"stage": development stage label"stage_scope": short statement of scope"result_metadata": compact Stage-1 metadata bundle"random_intercept": grouping structure used"scenario": resolved S1-S4 input scenario"repeated_observation": resolved repeated-observation structure"flow_adj_summary": posterior summary used forflow_adj"distance_source": where distance came from"diagnostics": lightweight fit/convergence metadata when available"prototype_notes": stage notes
Details
Users can either supply a single reduced-form formula, or separate
mobility_formula and bias_formula terms. The split interface
keeps the conceptual Level-2 true-flow predictors separate from the Level-1
MPD observation-bias predictors.
In target_scale = "mpd_counterfactual" mode, the adjusted flow is
computed from a formula-aware fixed-effect counterfactual prediction with
the numeric variables in bias_formula set to zero, then summarizes
draw-level adjusted flows by mean or median.
In target_scale = "true_flow" mode, coverage is treated as an
observation-process offset. The fitted model is
$$\log E(F^{mpd}_{ij}) = \log(q_{ij}) + \eta^{true}_{ij},$$
where \(q_{ij}\) is derived from origin, destination, or geometric-mean
active-user coverage. The returned flow_adj is the estimated
true-flow scale \(\exp(\eta^{true}_{ij})\).
In observation_model = "latent_two_level" mode, the Bayesian path
fits an experimental joint latent model. It estimates a source-invariant
true-flow intensity for each OD or OD-time state, then models each MPD
source/time row as a coverage-scaled noisy observation of that latent state.
This model is most informative for repeated source structures (S3/S4) and
remains weakly identified for S1/S2 designs without strong priors or
external validation.
Examples
if (requireNamespace("lme4", quietly = TRUE)) {
areas <- c("A", "B", "C", "D")
mpd <- expand.grid(
origin = areas,
destination = areas,
KEEP.OUT.ATTRS = FALSE
)
mpd$flow <- c(12, 7, 4, 5, 6, 14, 8, 4, 5, 7, 11, 6, 4, 5, 7, 13)
coverage <- data.frame(
origin = areas,
population = c(100, 120, 90, 110),
user_count = c(20, 30, 18, 22)
)
covariates <- data.frame(
area = areas,
rural_pct = c(0.15, 0.35, 0.60, 0.80),
population = c(100, 120, 90, 110)
)
distance <- expand.grid(
origin = areas,
destination = areas,
KEEP.OUT.ATTRS = FALSE
)
distance$distance_km <- abs(
match(distance$origin, areas) - match(distance$destination, areas)
) + 1
adj <- suppressMessages(
suppressWarnings(
adjust_multilevel_bayes(
mpd_od_df = mpd,
coverage_df = coverage,
covariates_df = covariates,
distance_df = distance,
mobility_formula = ~ rural_pct_o + rural_pct_d + log_distance +
(1 | origin),
bias_formula = ~ bias_e_origin,
model_engine = "frequentist",
model_family = "poisson"
)
)
)
head(adj)
}
#> # A tibble: 6 × 17
#> origin destination mpd_observed mpd_zero_filled mpd_row_status
#> <chr> <chr> <lgl> <lgl> <chr>
#> 1 A A TRUE FALSE observed
#> 2 B A TRUE FALSE observed
#> 3 C A TRUE FALSE observed
#> 4 D A TRUE FALSE observed
#> 5 A B TRUE FALSE observed
#> 6 B B TRUE FALSE observed
#> # ℹ 12 more variables: prediction_scope <chr>, model_fit_status <chr>,
#> # flow <dbl>, flow_adj <dbl>, flow_mpd_pred <dbl>, flow_true_pred <dbl>,
#> # coverage_rate_o <dbl>, coverage_rate_d <dbl>, distance_km <dbl>,
#> # log_distance <dbl>, bias_e_origin <dbl>, log_dist_synth <dbl>