Bayesian Multilevel Bias Adjustment for OD Flows (v0.2 Stage 1)

Estimates bias-adjusted OD flows from MPD counts using a flexible Bayesian multilevel count model. The observed-OD mode corrects observed MPD flows. The complete-grid mode fits to originally observed source rows and predicts adjusted flows for every row in a supplied square OD grid.

Usage

adjust_multilevel_bayes(
  mpd_od_df,
  coverage_df,
  covariates_df = NULL,
  distance_df = NULL,
  flow_col = "flow",
  income_col = NULL,
  pop_col = "population",
  distance_col = "distance_km",
  source_col = NULL,
  time_col = NULL,
  scenario = c("auto", "s1", "s2", "s3", "s4"),
  repeated_observation = c("auto", "none", "time", "source", "source_time"),
  random_intercept = c("origin", "destination", "od", "source", "time", "source_time",
    "none"),
  formula = NULL,
  custom_formula = NULL,
  mobility_formula = NULL,
  bias_formula = NULL,
  model_family = c("poisson", "negbin", "zip", "zinb"),
  model_engine = c("bayesian", "frequentist"),
  backend = c("auto", "rstanarm", "brms", "stan_latent"),
  flow_adj_summary = c("mean", "median"),
  target_scale = c("mpd_counterfactual", "true_flow"),
  observation_model = c("reduced_form", "coverage_offset", "latent_two_level"),
  coverage_scale = c("origin", "destination", "both"),
  latent_flow_unit = c("auto", "od", "od_time"),
  latent_coef_prior_scale = 1,
  latent_bias_prior_scale = 0.5,
  latent_intercept_prior_scale = 2.5,
  latent_state_prior_scale = 0.75,
  latent_source_prior_scale = 0.5,
  latent_time_prior_scale = 0.5,
  latent_phi_prior_rate = 1,
  latent_adapt_delta = 0.95,
  latent_max_treedepth = 12,
  latent_rng_eta_max = 20,
  prediction_scope = c("observed", "complete_grid"),
  iter = 1000,
  chains = 2,
  seed = 123,
  refresh = 0,
  include_flow_adj_draws = FALSE,
  keep_cols = character()
)

Arguments

mpd_od_df: Data frame with at least origin, destination, and flow_col. Optional mpd_source is carried through.
coverage_df: Data frame with at least origin, population, user_count. Optional destination and mpd_source enable destination/source-specific joins.
covariates_df: Optional area-level covariates with at least area. Each additional column is made available to formula twice using origin and destination suffixes, for example rural_pct_o and rural_pct_d. If pop_col exists, it is used for population covariates; otherwise populations are derived from coverage_df. Set to NULL when formula uses only OD, distance, coverage, source, or time fields.
distance_df: Optional OD distance data frame with origin, destination, and distance_col.
flow_col: Name of MPD flow column in mpd_od_df. Default "flow".
income_col: Deprecated. Optional name of the area-level covariate to use in the default formula when formula = NULL. Prefer specifying covariates directly in formula.
pop_col: Optional population column in covariates_df. Default "population". If missing, a population proxy is built from coverage_df.
distance_col: Name of distance column in distance_df (or in mpd_od_df if distance_df = NULL). Default "distance_km".
source_col: Optional mobile-phone data source column in mpd_od_df. If NULL and mpd_source exists, that column is used. Source metadata supports S3 and S4 inputs.
time_col: Optional observation-period column in mpd_od_df. Time metadata supports S2 and S4 inputs.
scenario: Input scenario contract: "auto", "s1", "s2", "s3", or "s4". S1 is single source/single time, S2 is single source/multiple times, S3 is multiple sources/single time, and S4 is multiple sources/multiple times. "auto" infers the scenario from the supplied source and time columns.
repeated_observation: Repeated-observation structure: "auto", "none", "time", "source", or "source_time". "auto" maps from scenario.
random_intercept: Random-intercept structure: "origin", "destination", "od", "source", "time", "source_time", or "none". Used only when formula is not supplied; formula random-effect terms such as (1 | origin) or (1 + log_distance | origin) take precedence.
formula: Optional model formula using the prepared model data. The response should usually be flow. The formula can include any prepared covariate columns, log_distance, bias_e_origin, scenario columns such as mpd_source and mpd_time, and random-effect terms supported by the chosen engine.
custom_formula: Deprecated alias for formula.
mobility_formula: Optional one-sided formula for the conceptual true-flow component, for example ~ rural_pct_o + rural_pct_d + log_distance + (1 | origin). Supply together with bias_formula; do not combine with formula or custom_formula.
bias_formula: Optional one-sided formula for the conceptual MPD observation-bias component, for example ~ bias_e_origin. Supply together with mobility_formula. Fixed-effect variables in this formula must be numeric, integer, or logical because the adjusted counterfactual sets them to zero.
model_family: Count family: "poisson", "negbin", "zip", or "zinb".
model_engine: Fitting engine. "bayesian" uses the Bayesian backend for S1-S4 source/time scenarios; "frequentist" uses a faster Poisson/negative-binomial GLM/GLMM scaffold for testing, experimentation, and runtime-sensitive comparisons.
backend: Bayesian backend: "auto", "rstanarm", "brms", or "stan_latent". "auto" chooses rstanarm for Poisson/NegBin reduced-form and coverage-offset models, brms for zero-inflated families, and the package's custom Stan backend for observation_model = "latent_two_level".
flow_adj_summary: Summary for posterior draw-level adjusted flows: "mean" or "median". This controls how the draw-level adjusted flows are collapsed into the returned flow_adj column.
target_scale: Adjustment target. "mpd_counterfactual" preserves the reduced-form MPD-scale counterfactual. "true_flow" estimates a true-flow scale by using coverage as an observation-process offset.
observation_model: Observation model. "reduced_form" preserves the existing fitted-bias-covariate path. "coverage_offset" uses active-user coverage as a fixed offset and is required for target_scale = "true_flow". "latent_two_level" fits an advanced joint latent model with a source-invariant true-flow state and source/time-specific MPD observation process.
coverage_scale: Coverage rate used by "coverage_offset": "origin" uses $c_i$, "destination" uses $c_j$, and "both" uses $\sqrt{c_i c_j}$.
latent_flow_unit: Latent state used when observation_model = "latent_two_level". "od" shares one latent flow across source/time observations of an OD pair; "od_time" shares one latent flow across sources within each OD-time cell. "auto" chooses "od_time" for S2 and for S4 only when each OD-time latent state has repeated observed sources; otherwise it uses "od".
latent_coef_prior_scale, latent_bias_prior_scale: Positive prior scales used by the custom "stan_latent" backend for true-flow and observation-bias coefficients. These controls are ignored by "rstanarm" and "brms" backends.
latent_intercept_prior_scale: Positive prior scale for the true-flow intercept in the custom "stan_latent" backend.
latent_state_prior_scale, latent_source_prior_scale, latent_time_prior_scale: Positive regularization scales used by the custom "stan_latent" backend for centered latent OD/OD-time, source, and time effects.
latent_phi_prior_rate: Positive exponential prior rate for the negative-binomial overdispersion parameter in the custom "stan_latent" backend.
latent_adapt_delta: Target acceptance probability passed to rstan::sampling() for the custom "stan_latent" backend.
latent_max_treedepth: Maximum tree depth passed to rstan::sampling() for the custom "stan_latent" backend.
latent_rng_eta_max: Upper clamp for generated-quantity log-rate exponentiation and RNG calls in the custom "stan_latent" backend. The likelihood and log-likelihood are not clamped; this prevents overflow in returned posterior prediction draws and replicated-count RNGs.
prediction_scope: Prediction contract. "observed" preserves the original observed-row workflow. "complete_grid" requires a strict square OD grid and predicts for all valid rows, while fitting on rows marked as originally observed when mpd_observed is present.
iter: Number of sampling iterations. Default 1000.
chains: Number of MCMC chains. Default 2.
seed: Random seed. Default 123.
refresh: Sampler progress refresh. Default 0.
include_flow_adj_draws: Logical; if TRUE, attach posterior draw matrix as attribute "flow_adj_draws". These are the draw-level adjusted observed flows underlying the returned flow_adj summary. Default FALSE.
keep_cols: Optional extra columns from mpd_od_df to keep.

Value

A tibble with identifiers, original flow, adjusted flow_adj, source row-status fields, bias_e_origin, and log-distance helpers. In prediction_scope = "observed", flow_adj is returned for observed rows in mpd_od_df. In prediction_scope = "complete_grid", flow_adj is returned for every valid row in the supplied square OD grid. Attributes:

"model": fitted model object
"formula": fitted formula
"coefficients": fixed-effect summary table
"model_engine": "bayesian" or "frequentist"
"backend": backend used
"model_family": fitted family
"model_terms": resolved fixed-effect and random-effect terms
"stage": development stage label
"stage_scope": short statement of scope
"result_metadata": compact Stage-1 metadata bundle
"random_intercept": grouping structure used
"scenario": resolved S1-S4 input scenario
"repeated_observation": resolved repeated-observation structure
"flow_adj_summary": posterior summary used for flow_adj
"distance_source": where distance came from
"diagnostics": lightweight fit/convergence metadata when available
"prototype_notes": stage notes

Details

Users can either supply a single reduced-form formula, or separate mobility_formula and bias_formula terms. The split interface keeps the conceptual Level-2 true-flow predictors separate from the Level-1 MPD observation-bias predictors.

In target_scale = "mpd_counterfactual" mode, the adjusted flow is computed from a formula-aware fixed-effect counterfactual prediction with the numeric variables in bias_formula set to zero, then summarizes draw-level adjusted flows by mean or median.

In target_scale = "true_flow" mode, coverage is treated as an observation-process offset. The fitted model is $$\log E(F^{mpd}_{ij}) = \log(q_{ij}) + \eta^{true}_{ij},$$ where $q_{ij}$ is derived from origin, destination, or geometric-mean active-user coverage. The returned flow_adj is the estimated true-flow scale $\exp(\eta^{true}_{ij})$.

In observation_model = "latent_two_level" mode, the Bayesian path fits a joint latent model for repeated-source empirical workflows. It estimates a source-invariant true-flow intensity for each OD or OD-time state, then models each MPD source/time row as a coverage-scaled noisy observation of that latent state. This model is most informative for repeated source structures (S3/S4) and remains weakly identified for S1/S2 designs without strong priors or external validation.

Examples

if (requireNamespace("lme4", quietly = TRUE)) {
  areas <- c("A", "B", "C", "D")
  mpd <- expand.grid(
    origin = areas,
    destination = areas,
    KEEP.OUT.ATTRS = FALSE
  )
  mpd$flow <- c(12, 7, 4, 5, 6, 14, 8, 4, 5, 7, 11, 6, 4, 5, 7, 13)

  coverage <- data.frame(
    origin = areas,
    population = c(100, 120, 90, 110),
    user_count = c(20, 30, 18, 22)
  )

  covariates <- data.frame(
    area = areas,
    rural_pct = c(0.15, 0.35, 0.60, 0.80),
    population = c(100, 120, 90, 110)
  )

  distance <- expand.grid(
    origin = areas,
    destination = areas,
    KEEP.OUT.ATTRS = FALSE
  )
  distance$distance_km <- abs(
    match(distance$origin, areas) - match(distance$destination, areas)
  ) + 1

  adj <- suppressMessages(
    suppressWarnings(
      adjust_multilevel_bayes(
        mpd_od_df = mpd,
        coverage_df = coverage,
        covariates_df = covariates,
        distance_df = distance,
        mobility_formula = ~ rural_pct_o + rural_pct_d + log_distance +
          (1 | origin),
        bias_formula = ~ bias_e_origin,
        model_engine = "frequentist",
        model_family = "poisson"
      )
    )
  )

  head(adj)
}
#> # A tibble: 6 × 17
#>   origin destination mpd_observed mpd_zero_filled mpd_row_status
#>   <chr>  <chr>       <lgl>        <lgl>           <chr>         
#> 1 A      A           TRUE         FALSE           observed      
#> 2 B      A           TRUE         FALSE           observed      
#> 3 C      A           TRUE         FALSE           observed      
#> 4 D      A           TRUE         FALSE           observed      
#> 5 A      B           TRUE         FALSE           observed      
#> 6 B      B           TRUE         FALSE           observed      
#> # ℹ 12 more variables: prediction_scope <chr>, model_fit_status <chr>,
#> #   flow <dbl>, flow_adj <dbl>, flow_mpd_pred <dbl>, flow_true_pred <dbl>,
#> #   coverage_rate_o <dbl>, coverage_rate_d <dbl>, distance_km <dbl>,
#> #   log_distance <dbl>, bias_e_origin <dbl>, log_dist_synth <dbl>