Skip to contents

Aim

This vignette describes the set of adjustment methods included in debiasR, to correct mobile-phone-derived origin-destination mobility counts.

The vignette describes how each adjustment method works, associated functions and parameters, and how to implement it. All the methods in debiasR use mobile-phone-derived flows as input data and return an adjusted flow flow_adj. However, they differ in fundamental ways, in terms of data input requirements, how they derive adjustment weights and the extent they can adjust biases in the data. These differences are described in more detail below. The practical choice of which methods to use should be guided by what data are available and validation against a credible benchmark. A key innovation of debiasR is the introduction of a Bayesian multilevel modelling approach which we also explain below.

Intuition

Let us quickly recall the challenge we face when working with mobile-phone-derived mobility flows. These flows are typically represent an small proportion of the resident population, hence we end up under-counting these populations. Though we can also over-count populations if individuals have multiple mobile devices and they are recorded in the data. So the idea of the methods included in debiasR is to adjust these counts to make these counts more representative of the actual observed flow. Formally, let FijmpdF^{mpd}_{ij} denote the observed mobile-phone-derived flow from origin ii to destination jj, and let FijadjF^{adj}_{ij} denote the adjusted flow. If a method treats the observed MPD flow as under-counted, it increases FijadjF^{adj}_{ij}. If it treats the observed MPD flow as over-counted, it reduces FijadjF^{adj}_{ij}.

Where, let us recall:

Symbol Description
FijmpdF^{mpd}_{ij} Observed MPD flow from origin ii to destination jj.
FijbenchF^{bench}_{ij} Benchmark flow from origin ii to destination jj.
FijadjF^{adj}_{ij} Adjusted flow returned by an adjustment method.
Pi,PjP_i, P_j Population at origin and destination.
Ui,UjU_i, U_j Active-user count at origin and destination.
ci=Ui/Pic_i = U_i/P_i Active-user coverage rate.
bi=1cib_i = 1 - c_i Active-user coverage bias.
dijd_{ij} Distance between origin ii and destination jj.
Xi,XjX_i, X_j Origin and destination covariates, such as rural share.

Methods in debiasR

This vignette covers the bias adjustment methods currently available in debiasR. The table below illustrates how they map onto debiasR functions and relevant references that have inspire our thinking. Below we explain how each of these methods adjust biases and how you can implement them via debiasR.

Method Function Reference
Inverse Penetration Rate Weights debiasR::adjust_inverse_penetration() Chi et al. (2025)
Selection Rate debiasR::adjust_selection_rate() Chi et al. (2025)
Selection Rate II debiasR::adjust_selection_rate2() Zagheni and Weber (2012)
Raking Ratio (or Iterative proportional fitting) debiasR::adjust_raking_ratio() Chi et al. (2025)
Coefficient regression modelling debiasR::adjust_coefficient() Chi et al. (2025)
Bayesian multilevel modelling debiasR::adjust_multilevel_bayes() Rowe and Cabrera (in progress); Raymer et al. (2013); Aparicio Castro et al. (2024)

Data

We use the same empirical local authority district (LAD) travel-to-work example introduced in the earlier vignettes. The observed origin-destination flow table comes from Locomizer-derived mobile phone data and the benchmark origin-destination flow table comes from the 2021 UK Census. The data are provided through the optional debiasRdata companion package and described in more detail in the Data vignette.

The README shows how to install debiasR and the companion debiasRdata package from GitHub. We start by loading those packages.

First, we identify the main LAD data objects directly from debiasRdata.

OD_travel2work <- debiasRdata::lad_OD_travel2work
census_OD_travel2work <- debiasRdata::census_lad_OD_travel2work
coverage <- debiasRdata::coverage_lad |>
  dplyr::rename(area = code)
covariates <- debiasRdata::lad_covariates
centroids <- debiasRdata::lad_centroids
head(OD_travel2work)
     origin destination flow
1 E06000001   E06000001  854
2 E06000001   E06000002   33
3 E06000001   E06000003   14
4 E06000001   E06000004   95
5 E06000001   E06000005   15
6 E06000001   E06000009    1
head(census_OD_travel2work)
     origin destination  flow
1 E06000001   E06000001 14513
2 E06000001   E06000002  1500
3 E06000001   E06000003   671
4 E06000001   E06000004  4240
5 E06000001   E06000005   398
6 E06000001   E06000007     2
head(coverage)
  date                 name      area population user_count
1 2021           Hartlepool E06000001      92338       1207
2 2021        Middlesbrough E06000002     143926       1028
3 2021 Redcar and Cleveland E06000003     136531       1144
4 2021     Stockton-on-Tees E06000004     196595       1755
5 2021           Darlington E06000005     107799       1165
6 2021               Halton E06000006     128478       1744

For the adjustment examples, we use debiasR::debiasR_example_data() to prepare a small, aligned version of these data. This keeps the vignette fast to render and creates the complete-grid OD structure and distance table used by some methods below.

example_data <- debiasR::debiasR_example_data(
  n_areas = 25,
  complete_grid = TRUE
)

mpd_od <- example_data$mpd_od
benchmark_od <- example_data$benchmark_od
coverage_adjustment <- example_data$coverage
covariates_adjustment <- example_data$covariates
distance <- example_data$distance

The method examples below use a common comparison view so the reported flow columns have the same meaning throughout the vignette:

  • flow_mpd_raw: the raw, unadjusted MPD flow used as input.
  • flow_mpd_adjusted: the adjusted MPD-derived estimate returned by the adjustment method.
  • flow_benchmark: the observed Census benchmark flow, shown for interpretation and validation.
benchmark_flow_lookup <- benchmark_od |>
  dplyr::select(
    origin,
    destination,
    flow_benchmark = flow
  )

show_adjustment_comparison <- function(adjusted_df,
                                       extra_cols = character(),
                                       n = 5) {
  adjusted_df |>
    dplyr::left_join(
      benchmark_flow_lookup,
      by = c("origin", "destination")
    ) |>
    dplyr::select(
      origin,
      destination,
      flow_mpd_raw = flow,
      flow_mpd_adjusted = flow_adj,
      flow_benchmark,
      dplyr::any_of(extra_cols)
    ) |>
    dplyr::slice_head(n = n)
}

Inverse penetration rate

The inverse penetration rate adjusts origin-destination flows by scaling observed mobile-phone-derived flows according to how well the population is represented in the mobile-phone data. For each area, we calculate the active-user coverage rate (cic_i) as the number of observed active users (UiU_i) divided by the benchmark population (PiP_i). The inverse of this rate becomes the adjustment weight (wiw_i). Areas with lower coverage receive larger weights because their flows are assumed to be more undercounted. For example, if only 5% of residents in an origin area appear in the mobile-phone data, flows from that origin can be scaled up by a factor of 20; that is 1/0.05=201 / 0.05 = 20, so each observed flow is treated as representing about twenty times as many people in the benchmark population. The adjustment can use origin coverage, destination coverage or a combination of both. This makes the method simple and transparent, but it assumes that bias is mainly proportional to population coverage and does not directly account for differences in who is represented among mobile-phone users. Formally:

Penetration rate and inverse penetration rate weight:

ci=UiPi,wi=1ci=PiUi c_i = \frac{U_i}{P_i}, \qquad w_i = \frac{1}{c_i} = \frac{P_i}{U_i}

Origin-weighted adjustment:

Fijadj=wiFijmpd F^{adj}_{ij} = w_i F^{mpd}_{ij}

Destination-weighted adjustment:

Fijadj=wjFijmpd F^{adj}_{ij} = w_j F^{mpd}_{ij}

Both-side adjustment:

Fijadj=wiwjFijmpd F^{adj}_{ij} = \sqrt{w_iw_j}F^{mpd}_{ij}

Implementation

To use debiasR::adjust_inverse_penetration(), we input our observed mobile-phone-derived origin-destination flows via mpd_od_df and coverage data to coverage_df in tabular format. The coverage table should contain the benchmark population and active-user counts needed to calculate how strongly each area is underrepresented. The weight_by argument lets you choose where the correction is applied: “origin” adjusts flows using coverage at the origin, “destination” uses coverage at the destination and “both” combines both sides. The result is a data frame that preserves the original observed flow in flow and adds flow_adj, the adjusted flow after inverse penetration weighting.

If we implement inverse penetration weighting adjusting by origins only, we set weight_by to “origin” as done below. The resulting table compares the original observed MPD flow, the inverse-penetration-adjusted estimate and the Census benchmark flow. The difference between the adjusted and benchmark columns gives a quick descriptive view of the adjustment, though we assess it formally in the validation vignette.

adj_inverse_penetration <- debiasR::adjust_inverse_penetration(
  mpd_od_df = mpd_od,
  coverage_df = coverage_adjustment,
  weight_by = "origin"
)

show_adjustment_comparison(adj_inverse_penetration)
# A tibble: 5 × 5
  origin    destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
  <chr>     <chr>              <dbl>             <dbl>          <dbl>
1 E06000011 E06000011           2403           53758.           57155
2 E06000011 E06000016              1              22.4             10
3 E06000011 E06000018              1              22.4             15
4 E06000011 E06000023              2              44.7              3
5 E06000011 E06000047              6             134.              11

This method is a useful first adjustment when the main known source of bias is uneven population coverage across areas. It is especially helpful when we have reliable benchmark population counts and active-user counts, but no trusted benchmark origin-destination flow table. In that setting, inverse penetration weighting provides a transparent way to rescale observed MPD flows so that areas with lower mobile-phone coverage contribute more strongly to the adjusted flow estimates. Its main limitation is that it treats coverage as the correction mechanism. It assumes that the observed users in an area are broadly representative of the residents who are missing from the data. It therefore cannot directly account for social, demographic, behavioural or spatial differences in who appears in the MPD source. If mobile-phone users are systematically different from non-users, or if the relationship between coverage and mobility varies across places, this method may reduce undercounting while still leaving important selection bias in the adjusted flows.

Selection rate

Selection rate adjusts origin-destination flows by allowing the adjustment weight to depend on both population coverage and a covariate that helps explain who appears in the mobile-phone-derived data. The intuition is that two areas can have the same active-user coverage rate but still differ in the kinds of people represented in the MPD source. The selection-rate method therefore adds an area characteristic, such as rural share, to the coverage correction. It can also use benchmark OD flows to calibrate the selection parameter, so that the adjusted flows are closer to a trusted benchmark when one is available. This makes the method more flexible than inverse penetration weighting, but it also makes the result more dependent on the quality of the covariate and the calibration choice. Formally:

Origin-side selection weight:

wi(O)(rt)=1Iici(O)+(1Ii)rt w_i^{(O)}(r_t) = \frac{1}{I_i c_i^{(O)} + (1 - I_i) r_t}

Destination-side selection weight:

wj(D)(rt)=1Ijcj(D)+(1Ij)rt w_j^{(D)}(r_t) = \frac{1}{I_j c_j^{(D)} + (1 - I_j) r_t}

Adjusted flow:

Fijadj=Fijmpdwij F^{adj}_{ij} = F^{mpd}_{ij}w_{ij}

where IiI_i and IjI_j are covariate values; rtr_t is the selection-rate parameter; and, wijw_{ij} is the origin, destination or both-side correction.

Implementation

To use debiasR::adjust_selection_rate(), we again input the observed MPD origin-destination flows through mpd_od_df and the population coverage table through coverage_df. The covariates_df and covariate_col arguments tell the function which area-level characteristic to use when estimating differential selection. In this example, we use rural_pct, the rural land-share covariate from debiasRdata. While all other parameters in debiasR::adjust_selection_rate() have defaults, we recommend users to include specify benchmark_od_df for a meaningful selection-rate adjustment. This data frame corresponds to a benchmark OD table used to calibrate the selection rate parameter r in the equations above.

What about if you do not have benchmark flow?

debiasR::adjust_selection_rate() has two no-benchmark routes:

  1. you can provide r_global directly based on prior evidence, substantive knowledge or a sensitivity-analysis design.

  2. if both benchmark_od_df and r_global are omitted, debiasR uses a fallback based on the overall active-user coverage rate sum(U) / sum(P). This makes the method executable, but the result should be interpreted as an exploratory adjustment rather than a calibrated estimate.

Our recommendation is to calibrate with a credible benchmark_od_df whenever one is available. If no benchmark exists, we recommend treating r_global as a sensitivity parameter. Report the fallback value, test a small set of plausible alternatives and validate the adjusted flows against any aggregate or external evidence that is available.

To calibrate MPD flows, calibration_aggregateis set to origin. This means that MDP flows are calibrated against origin-level benchmark totals. This function tries each value in a grid of r values (r_grid), adjusts the MPD flows, sums adjusted and benchmark flows by origin, and chooses the r_t that minimises the absolute difference in origin totals. r_grid is the set of candidate values that debiasR::adjust_selection_rate() tries when it needs to calibrate the global selection parameter r_t. If calibration_aggregate is set to od, it skips that origin aggregation. Instead, for each candidate r_t, it compares each adjusted OD cell directly with the benchmark OD cell (as described in the note below), and picks the r_t with the smallest total OD-level error. The weight_by argument controls whether the correction is applied using origin coverage, destination coverage or both.

adj_selection_rate <- debiasR::adjust_selection_rate(
  mpd_od_df = mpd_od,
  coverage_df = coverage_adjustment,
  covariates_df = covariates_adjustment,
  covariate_col = "rural_pct",
  weight_by = "origin",
  benchmark_od_df = benchmark_od,
  calibration_aggregate = "origin"
)

show_adjustment_comparison(adj_selection_rate)
# A tibble: 5 × 5
  origin    destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
  <chr>     <chr>              <dbl>             <dbl>          <dbl>
1 E06000011 E06000011           2403           54984.           57155
2 E06000011 E06000016              1              22.9             10
3 E06000011 E06000018              1              22.9             15
4 E06000011 E06000023              2              45.8              3
5 E06000011 E06000047              6             137.              11
tibble::tibble(r_global = attr(adj_selection_rate, 
                               "r_global"))
# A tibble: 1 × 1
  r_global
     <dbl>
1     0.04

The comparison table reports the original MPD flow, the selection-rate adjusted estimate and the Census benchmark flow. Because this example uses weight_by = "origin", all flows leaving a given origin are adjusted using that origin’s coverage and selected covariate value. The reported r_global attribute is the calibrated selection-rate parameter chosen by the function. This method is useful when coverage alone is too simple and we believe representation varies with an area characteristic. Its main limitation is that the chosen covariate must be meaningful: if the covariate is weakly related to selection into the MPD source, the adjustment may add complexity without reducing bias. The weight_missing column in the adj_selection_rate is a diagnostic flag if missing and / or invalid origin coverage data leads to missing and / or invalid weight inputs.

Selection rate II

Selection rate II adjusts origin-destination flows by applying a compact coverage correction controlled by a single curvature parameter, kk. The method keeps the focus on population coverage, but instead of using the simple inverse of the coverage rate, it transforms the coverage rate through a nonlinear correction factor. This gives users a middle ground between inverse penetration weighting and the fuller covariate-based selection-rate model. When benchmark OD flows are available, debiasR can calibrate kk so that the adjusted flows better match a trusted comparison. Formally:

Coverage correction factor:

CF(c;k)=cexp(k)1exp(kc)1 CF(c; k) = c\frac{\exp(-k) - 1}{\exp(-kc) - 1}

where CF(c;k)CF(c; k) is the coverage correction factor, cc is the active-user coverage rate for the selected origin or destination area, and kk is the curvature parameter that controls how strongly coverage is transformed.

Adjusted flow:

Fijadj=FijmpdCF(c;k) F^{adj}_{ij} = F^{mpd}_{ij}CF(c; k)

The same correction can be applied on the origin side, destination side, or both sides depending on weight_by.

Implementation

To use debiasR::adjust_selection_rate2(), the strict minimum set of parameter to be specified are MPD flows and coverage data. Input the observed MPD flows to mpd_od_df and the coverage table to coverage_df. The weight_by argument controls whether the correction uses origin coverage, destination coverage or both. You can set k directly if you have a chosen correction curvature, or leave it as NULL and provide benchmark_od_df so the function calibrates k over k_grid. As above, calibration_aggregate = "origin" calibrates against origin-level benchmark totals. The parameter k is the analog to r in debiasR::adjust_selection_rate().

adj_selection_rate2 <- debiasR::adjust_selection_rate2(
  mpd_od_df = mpd_od,
  coverage_df = coverage_adjustment,
  weight_by = "origin",
  benchmark_od_df = benchmark_od,
  calibration_aggregate = "origin"
)

show_adjustment_comparison(adj_selection_rate2)
# A tibble: 5 × 5
  origin    destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
  <chr>     <chr>              <dbl>             <dbl>          <dbl>
1 E06000011 E06000011           2403          2292.             57155
2 E06000011 E06000016              1             0.954             10
3 E06000011 E06000018              1             0.954             15
4 E06000011 E06000023              2             1.91               3
5 E06000011 E06000047              6             5.72              11
tibble::tibble(k = attr(adj_selection_rate2, "k"))
# A tibble: 1 × 1
      k
  <dbl>
1   0.1

The comparison table reports the observed MPD count, the adjusted flow and the Census benchmark flow. In this origin-weighted example, the correction applied to a flow depends on the origin area’s coverage rate and the calibrated kk value. The reported k attribute shows the curvature parameter used by the adjustment. This method is useful when coverage is the dominant concern and a compact correction is adequate. Its main limitation is that one parameter may be too simple when selection bias varies strongly across places or population groups.

Raking ratio

Raking ratio is also known as iterative proportional fitting (IPF). Raking ratio adjusts origin-destination flows by iteratively rescaling the observed MPD matrix so that its origin and destination totals match trusted benchmark margins. The intuition is different from coverage weighting. Instead of calculating a weight from active-user coverage, the method asks the adjusted OD table to satisfy known aggregate constraints. If we trust total outflows from each origin and total inflows to each destination, IPF updates the OD cells until the row and column totals agree with those benchmarks. Formally:

Margin constraints:

jFijadj=Oibench,iFijadj=Djbench \sum_j F^{adj}_{ij} = O_i^{bench}, \qquad \sum_i F^{adj}_{ij} = D_j^{bench}

where OibenchO_i^{bench} and DjbenchD_j^{bench} are benchmark outflow and inflow totals.

Implementation

To use debiasR::adjust_raking_ratio(), the minimum set of parameter to specify is the MDP flows and set of constraints. Input the observed MPD flows to mpd_od_df and either provide benchmark_od_df or supply explicit origin_targets and destination_targets. In this vignette, we use benchmark_od_df, so debiasR derives the target origin and destination margins from the benchmark OD table.

The max_iter and tol arguments control the iterative fitting process: max_iter sets the maximum number of update cycles and tol sets the convergence tolerance.

adj_raking_ratio <- debiasR::adjust_raking_ratio(
  mpd_od_df = mpd_od,
  benchmark_od_df = benchmark_od
)

show_adjustment_comparison(adj_raking_ratio)
# A tibble: 5 × 5
  origin    destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
  <chr>     <chr>              <dbl>             <dbl>          <dbl>
1 E06000011 E06000011           2403           52864.           57155
2 E06000011 E06000016              1              26.3             10
3 E06000011 E06000018              1              20.4             15
4 E06000011 E06000023              2              57.2              3
5 E06000011 E06000047              6             137.              11
adj_raking_ratio |>
  dplyr::summarise(
    raw_total = sum(flow, na.rm = TRUE),
    adjusted_total = sum(flow_adj, na.rm = TRUE),
    median_weight = median(weight_ipf, na.rm = TRUE),
    ipf_converged = attr(adj_raking_ratio, "ipf_converged"),
    ipf_iterations = attr(adj_raking_ratio, "ipf_iterations")
  )
# A tibble: 1 × 5
  raw_total adjusted_total median_weight ipf_converged ipf_iterations
      <dbl>          <dbl>         <dbl> <lgl>                  <int>
1     80695        2302435          27.9 TRUE                     172

The first table shows how individual observed MPD flows have been converted into adjusted flows and how those adjusted flows compare with the Census benchmark. The summary reports whether the IPF routine converged, how many iterations it used, and how much the total flow changed after adjustment. Raking is useful when trusted marginal totals are available and consistency with those totals is a priority. Its main limitation is that matching origin and destination margins does not guarantee that the internal allocation between specific origin-destination pairs is correct, so it should still be validated at the OD-pair level.

Coefficient regression modelling

Coefficient regression modelling adjusts origin-destination flows by learning a direct empirical relationship between observed MPD flows and benchmark flows. The method is useful when a trusted benchmark OD table is available for the same origin-destination pairs. Instead of using population coverage or margins, it fits a regression model that estimates how MPD counts map onto benchmark counts, then applies that fitted relationship to the MPD table. In its simplest form, this is a single multiplicative coefficient. In count-model versions, the relationship is estimated through a log link. Flow measured as proportions or probability as, for example, the number of outflows over the population at risk could be used as input, as well. Formally:

Core proportional form:

E(Fijbench)=βFijmpd E(F^{bench}_{ij}) = \beta F^{mpd}_{ij}

Count-model form:

log(μij)=α+log(Fijmpd) \log(\mu_{ij}) = \alpha + \log(F^{mpd}_{ij})

so that:

μij=eαFijmpd,β=eα \mu_{ij} = e^{\alpha}F^{mpd}_{ij}, \qquad \beta = e^{\alpha}

Implementation

To use debiasR::adjust_coefficient(), the minimum input requirement are the MPD and benchmark flows. Input the observed MPD flows to mpd_od_df and a benchmark OD table to benchmark_od_df. The model_family argument controls the regression family used to learn the MPD-to-benchmark relationship; this example uses "ols" for a simple proportional calibration. The level argument controls whether the coefficient is estimated at the OD, origin or destination level. Additional options, such as fit_intercept and by_source, let users decide whether to estimate an intercept or source-specific coefficients.

adj_coefficient <- debiasR::adjust_coefficient(
  mpd_od_df = mpd_od,
  benchmark_od_df = benchmark_od,
  model_family = "ols",
  level = "od"
)

show_adjustment_comparison(adj_coefficient)
# A tibble: 5 × 5
  origin    destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
  <chr>     <chr>              <dbl>             <dbl>          <dbl>
1 E06000011 E06000011           2403           69441.           57155
2 E06000011 E06000016              1              28.9             10
3 E06000011 E06000018              1              28.9             15
4 E06000011 E06000023              2              57.8              3
5 E06000011 E06000047              6             173.              11
attr(adj_coefficient, "coef")
[1] 28.89774

The comparison table shows the original observed MPD flow, the regression adjusted estimate and the Census benchmark flow. The coefficient attribute reports the fitted scaling relationship used to convert MPD flows into benchmark-like flows. This method is direct and easy to validate because the benchmark is part of the model-fitting process. Its main limitation is that benchmark quality is critical, and a simple coefficient can miss local heterogeneity unless the chosen model family and estimation level capture the relevant structure.

Bayesian multilevel modelling

Bayesian multilevel modelling adjusts origin-destination flows by fitting a structured count model to the observed MPD flows and estimating the underlying true-flow scale. This vignette focuses on the recommended default implementation: the coverage-offset true-flow model. This mode treats MPD flows as coverage-scaled observations of true flows, rather than as flows that are only adjusted after fitting by switching off a bias covariate. The multilevel structure can borrow information across origins, destinations and corridors when those data structures are available. The Bayesian implementation can take time, so the vignette shows the model code but renders a precomputed example output. For analysis, run enough iterations and chains for your data and check the sampler diagnostics before interpreting the posterior summaries.

adjust_multilevel_bayes() also includes advanced Bayesian modes for repeated source/time data, backward compatibility and fast structure checking. The guide below gives the practical choice points, while the advanced Bayesian adjustment vignette keeps the deeper formulas and diagnostic detail.

A practical guide is:

Option Use it when Returned scale What it returns
observation_model = "coverage_offset" You want the default Bayesian true-flow adjustment. True-flow scale MPD counts are treated as coverage-scaled observations of true flows. Start here.
observation_model = "reduced_form" You need backward compatibility or an MPD-scale sensitivity check. MPD-scale counterfactual The model fits observed MPD flows directly and predicts a zero-bias counterfactual.
observation_model = "latent_two_level" You have repeated source/time observations, especially S3 or S4 structures. Latent true-flow scale An experimental Stan backend estimates a shared latent OD or OD-time true-flow state.
model_engine = "frequentist" You want to test S1-S4 data structure quickly before Bayesian sampling. Fast non-posterior check This is a fast design/checking engine, not posterior inference.

For routine adjustment, the practical default is:

Target Recommended setting
Adjustment scale target_scale = "true_flow"
Observation model observation_model = "coverage_offset"
Coverage offset coverage_scale = "origin", "destination" or "both"

In coverage-offset true-flow mode, the model can be written as:

Level 1: MPD observation model

FijmpdPoisson(qijFijtrue) F^{mpd}_{ij} \sim \text{Poisson}(q_{ij} F^{true}_{ij})

logE(Fijmpd)=log(qij)+ηijtrue \log E(F^{mpd}_{ij}) = \log(q_{ij}) + \eta^{true}_{ij}

where qijq_{ij} is the observation probability derived from active-user coverage. With coverage_scale = "origin", qij=ci=Ui/Piq_{ij} = c_i = U_i/P_i. The term log(qij)\log(q_{ij}) is an offset: it fixes the observation-process scale and is not estimated as a regression coefficient.

Level 2: true mobility-flow model

ηijtrue=α+βoXi+βdXj+γlogdij+ui+vj+wij+ξij \eta^{true}_{ij} = \alpha + \beta_o X_i + \beta_d X_j + \gamma \log d_{ij} + u_i + v_j + w_{ij} + \xi_{ij}

where XiX_i and XjX_j are origin and destination covariates, dijd_{ij} is the origin-destination distance, and uiu_i, vjv_j and wijw_{ij} capture origin, destination and origin-destination structure.

Intuitively, Level 1 recognises that we observe MPD flows, not the true mobility flows themselves. The observed MPD flow FijmpdF^{mpd}_{ij} is treated as a coverage-scaled measurement of an underlying true flow FijtrueF^{true}_{ij}. If an origin has 20% active-user coverage, the fitted MPD-scale expectation is the true-flow expectation multiplied by 0.20 under the origin-offset assumption. For example, if the model estimates 500 true trips from origin ii to destination jj and origin ii has 20% active-user coverage, the corresponding MPD observation-scale prediction is 0.20×500=1000.20 \times 500 = 100 trips.

Level 2 describes the underlying mobility pattern learned from covariates, distance and origin/destination/corridor structure. It assumes that the true flow FijtrueF^{true}_{ij} depends on origin and destination characteristics, distance, and persistent origin, destination and corridor structure. This gives the model a way to borrow information across related places instead of treating every OD corridor as a completely separate unit.

FijtrueF^{true}_{ij} is not itself a random intercept. It is the estimated true-flow scale for a specific origin-destination pair, computed from the full true-flow linear predictor ηijtrue\eta^{true}_{ij}. Random intercepts, when included, are components of that predictor. An origin random intercept uiu_i shifts all flows leaving origin ii up or down; a destination random intercept vjv_j shifts flows arriving at destination jj; and an OD random intercept wijw_{ij} captures a corridor-specific deviation. In the small S1 example below we set random_intercept = "none" because each corridor appears once, so FijtrueF^{true}_{ij} is estimated from the fixed-effect predictors and the coverage offset rather than from random intercepts.

The model adjusts flows by predicting on two scales. The fitted MPD-scale prediction keeps the coverage offset:

Fijmpd,pred=exp(log(qij)+ηijtrue) F^{mpd,pred}_{ij} = \exp(\log(q_{ij}) + \eta^{true}_{ij})

The adjusted true-flow prediction removes the observation offset:

Fijadj=Fijtrue,pred=exp(ηijtrue) F^{adj}_{ij} = F^{true,pred}_{ij} = \exp(\eta^{true}_{ij})

In a Bayesian framework, these predictions are computed draw-by-draw from the posterior and summarised by mean or median. This means flow_adj is a posterior summary of the true-flow prediction, not a single deterministic rescaling of the observed MPD count.

Implementation

To use debiasR::adjust_multilevel_bayes(), input the observed MPD flows to mpd_od_df, the population coverage table to coverage_df, area-level covariates to covariates_df and OD distances to distance_df. The compact default example below uses a single source and a single observation period. The prediction_scope argument controls whether the method adjusts only observed MPD pairs or predicts over a supplied complete OD grid.

Input What it supplies
mpd_od_df Observed MPD origin-destination counts.
coverage_df Active-user coverage rates used to build the observation offset.
covariates_df Origin and destination predictors for the true-flow model.
distance_df OD distances or distance-derived predictors.

The equations above translate into two conceptual formula components. mobility_formula defines the true-flow model; here we use rural_pct_o, rural_pct_d and log_distance to represent origin covariates, destination covariates and distance. In the recommended default mode, the observation model is selected with target_scale = "true_flow" and observation_model = "coverage_offset", while coverage_scale selects whether the offset uses origin, destination or geometric-mean coverage. The coverage component is represented by this fixed offset rather than by an estimated coverage-bias coefficient in the fitted true-flow formula.

The default coverage-offset model is the safest Bayesian starting point because it makes the observation process explicit while keeping the fitted structure readable. adjust_multilevel_bayes() does not need benchmark OD flows to fit this model. In this vignette, benchmark flows are shown next to the raw and adjusted MPD-derived estimates so that readers can interpret and validate the adjustment.

The example below therefore uses a Bayesian fixed-effect true-flow model with an origin coverage offset. The model code is shown for reproducibility, while the rendered table is loaded from a precomputed package artifact so routine vignette updates do not rerun MCMC.

mpd_s1 <- mpd_od |>
  dplyr::mutate(
    mpd_source = "operator_a",
    mpd_time = "2021_q1"
  )

coverage_s1 <- coverage_adjustment |>
  dplyr::mutate(
    mpd_source = "operator_a",
    mpd_time = "2021_q1"
  )

adj_multilevel <- debiasR::adjust_multilevel_bayes(
  mpd_od_df = mpd_s1,
  coverage_df = coverage_s1,
  covariates_df = covariates_adjustment,
  distance_df = distance,
  mobility_formula = ~ rural_pct_o + rural_pct_d + log_distance,
  bias_formula = ~ bias_e_origin,
  target_scale = "true_flow",
  observation_model = "coverage_offset",
  coverage_scale = "origin",
  model_engine = "bayesian",
  scenario = "s1",
  source_col = "mpd_source",
  time_col = "mpd_time",
  repeated_observation = "none",
  prediction_scope = "complete_grid",
  random_intercept = "none",
  model_family = "poisson",
  flow_adj_summary = "median",
  include_flow_adj_draws = TRUE,
  iter = 1000,
  chains = 2,
  seed = 123,
  refresh = 0
)

The precomputed output below reports both posterior median and posterior mean summaries from this model fit.

origin destination flow_mpd_raw flow_mpd_adjusted_median flow_mpd_adjusted_mean flow_benchmark observation_probability flow_mpd_pred_median flow_mpd_pred_mean
E06000011 E06000011 2403 82289.89 82299.34 57155 0.04 3678.41 3678.83
E06000011 E06000016 1 506.20 506.70 10 0.04 22.63 22.65
E06000011 E06000018 1 523.64 524.08 15 0.04 23.41 23.43
E06000011 E06000023 2 461.64 462.10 3 0.04 20.64 20.66
E06000011 E06000047 6 508.61 508.49 11 0.04 22.73 22.73

The rendered table uses compact method-comparison names. The benchmark column is shown for interpretation and validation, but it is not used to fit this Bayesian model. The guide below maps the displayed columns to the names used in the object returned by adjust_multilevel_bayes().

Rendered column Returned object column Meaning
flow_mpd_raw flow Observed MPD count.
flow_mpd_adjusted_median flow_adj_median Posterior median of the adjusted true-flow estimate.
flow_mpd_adjusted_mean flow_adj_mean Posterior mean of the adjusted true-flow estimate.
flow_benchmark not used for fitting Benchmark OD flow shown for interpretation and validation.
observation_probability observation_probability Fixed coverage offset on the probability scale.
flow_mpd_pred_median flow_mpd_pred_median Posterior median of the fitted MPD observation-scale flow.
flow_mpd_pred_mean flow_mpd_pred_mean Posterior mean of the fitted MPD observation-scale flow.

The full returned object also includes flow_true_pred, flow_mpd_pred, and interval columns such as flow_adj_q2.5, flow_adj_q97.5, flow_mpd_pred_q2.5 and flow_mpd_pred_q97.5. The function call sets flow_adj_summary = "median", so the returned flow_adj equals flow_adj_median by default.

The saved example uses moderate sampler settings for a compact illustration. For applied analysis, run enough iterations and chains for your data and inspect the diagnostics.

Additional outputs and diagnostics

You can inspect the metadata and terms to confirm the scenario and formula fitted. For a live fit, also inspect sampler diagnostics before interpreting the posterior summaries.

attr(adj_multilevel, "result_metadata")
attr(adj_multilevel, "model_terms")
attr(adj_multilevel, "diagnostics")
field value
model_engine bayesian
backend rstanarm
target_scale true_flow
observation_model coverage_offset
coverage_scale origin
offset_column log_observation_probability
scenario s1
repeated_observation none
n_sources 1
n_time_periods 1
prediction_scope complete_grid
flow_adj_summary median
component value
formula flow ~ rural_pct_o + rural_pct_d + log_distance + offset(log_observation_probability)
formula_source split_formula
formula_interface split_true_flow
mobility_formula ~rural_pct_o + rural_pct_d + log_distance
bias_formula ~bias_e_origin
mobility_variables rural_pct_o, rural_pct_d, log_distance
bias_variables NA
user_formula TRUE
custom_formula FALSE
default_area_covariate rural_pct
default_fixed_effects NA
scenario_fixed_effects NA
formula_variables flow, rural_pct_o, rural_pct_d, log_distance, log_observation_probability
requested_random_intercept none
random_effect_term NA
formula_random_effects NA

The metadata report the engine, target scale, observation model, coverage offset, resolved scenario, repeated-observation structure, prediction scope and model terms used to fit the adjustment model. They also confirm that this example is using the S1 contract: one MPD source and one observation period. In a live returned object, flow stores the observed MPD count, flow_mpd_pred stores the fitted MPD-scale prediction, and flow_true_pred and flow_adj store the estimated true-flow adjustment. This method is useful when users need a structured model that can combine coverage, covariates, distance and hierarchical data structure. Its main limitation is that it is more data hungry than other methods and requires significantly more runtime, dependencies and diagnostic checking when fitted with the full Bayesian engine. The key strength is that the model does not require a benchmark dataset of flows.

More complex Bayesian structures

The example above uses the simplest S1 structure: one source and one observation period. adjust_multilevel_bayes() can also accept S2-S4 source/time structures through scenario, source_col, time_col and repeated_observation.

For repeated source/time data, the latest experimental option is observation_model = "latent_two_level". In this mode, the model estimates one shared latent true-flow state for an OD pair or OD-time group and treats each source/time MPD row as an observation of that state. This is most useful for S3 and S4 structures, where multiple MPD sources observe the same underlying mobility state. Source and time effects live in the observation layer, so flow_true_pred is source-invariant within a latent state, while flow_mpd_pred can vary by source and time.

adj_latent <- adjust_multilevel_bayes(
  mpd_od_df = mpd_repeated,
  coverage_df = coverage_repeated,
  covariates_df = covariates,
  distance_df = distance,
  mobility_formula = ~ rural_pct_o + rural_pct_d + log_distance,
  bias_formula = ~ bias_e_origin,
  target_scale = "true_flow",
  observation_model = "latent_two_level",
  backend = "stan_latent",
  model_engine = "bayesian",
  scenario = "s4",
  source_col = "mpd_source",
  time_col = "mpd_time",
  repeated_observation = "source_time",
  latent_flow_unit = "auto"
)

attr(adj_latent, "result_metadata")
attr(adj_latent, "diagnostics")

backend = "auto" also selects the latent Stan backend for this observation model.

This backend is still experimental. Use it after checking that the data really contain repeated source/time observations and after inspecting the diagnostics, including divergence rate, treedepth hit rate, acceptance summaries, R-hat and effective sample size and posterior predictive metadata where available. The advanced Bayesian adjustment vignette provides deeper detail on source-time structures, reduced-form compatibility and latent-backend diagnostics.

Fit all methods

We can also run all six adjustment methods at once on the same empirical LAD subset. The helper uses the frequentist multilevel engine by default here so the method-comparison table renders quickly; switch multilevel_engine to "bayesian" when you want posterior summaries and have time to inspect sampler diagnostics.

method_results <- fit_adjustment_methods(
  example_data,
  include_multilevel = TRUE,
  multilevel_engine = "frequentist"
)

method_engine <- function(x) {
  engine <- attr(x, "model_engine", exact = TRUE)
  if (is.null(engine)) "deterministic" else engine
}

tibble::tibble(
  method = names(method_results),
  engine = vapply(method_results, method_engine, character(1)),
  rows = vapply(method_results, nrow, integer(1)),
  adjusted_total = vapply(
    method_results,
    function(x) sum(x$flow_adj, na.rm = TRUE),
    numeric(1)
  )
)
# A tibble: 6 × 4
  method              engine         rows adjusted_total
  <chr>               <chr>         <int>          <dbl>
1 inverse_penetration deterministic   625       2299443.
2 selection_rate      deterministic   625       2055645.
3 selection_rate2     deterministic   625         76931.
4 raking_ratio        deterministic   625       2302435
5 coefficient         deterministic   625       2331903.
6 multilevel_bayes    frequentist     625         56235.

Method summary

As described above, each of the adjustment methods implemented via debiasR has different assumptions, input data requirements, strengths and weaknesses. The table below offers a summary which can be used as a conceptual guide helping users identify which method is more appropriate given their purpose and constraints.

Method Correction
level
How the correction
is applied
Rationale Equation Benchmark
population
Active
users
Benchmark
OD flows
Covariates Strengths Limitations
Inverse Penetration Rate Weights Area-specific coverage correction Applies origin, destination, or both-side coverage weights using weight_by. Inflates observed flows using population-to-user ratios where coverage is thin.
wi = Pi / Ui
Fadjij = Fmpdij × wi, wj, or √(wiwj)
Yes Yes No No Simple, transparent, fast, and does not require benchmark OD flows. Can overinflate sparse areas and misses compositional selection bias.
Selection Rate Area-specific selection correction Applies origin, destination, or both-side selection weights controlled by a global rt parameter. Uses coverage, covariates, and optional benchmark calibration to model differential selection.
wi(rt) = 1 / (Iici + (1 - Ii)rt)
Fadjij = Fmpdij × wij
Yes Yes Optional Yes More flexible than inverse penetration and can use benchmark calibration. Depends on covariate quality, calibration choices, and a meaningful selection parameter.
Selection Rate II Area-specific nonlinear coverage correction Applies origin, destination, or both-side coverage correction controlled by a global k parameter. Applies a compact coverage correction with optional k calibration.
CF(c; k) = c × (exp(-k) - 1) / (exp(-kc) - 1)
Fadjij = Fmpdij × CF(…)
Yes Yes Optional No Parsimonious, interpretable, and lighter than a covariate-rich adjustment. Less flexible when bias varies strongly across areas.
Raking Ratio (or Iterative proportional fitting) Margin-constrained correction Iteratively rescales OD cells so adjusted origin and/or destination totals match supplied margins. Uses iterative proportional fitting so adjusted flows match trusted origin and destination margins.
Fadjij = Fmpdij × wij
ΣjFadjij = Oi and ΣiFadjij = Dj
Optional as targets No Yes or user targets No Guarantees consistency with supplied marginal totals. Matching margins does not guarantee correct destination allocation within each origin.
Coefficient regression modelling Global coefficient correction Estimates a multiplicative coefficient from MPD-benchmark overlap and applies it to the MPD table. Learns a direct mapping from MPD flows to benchmark flows and applies it to the MPD table.
E[Fbenchij] = β Fmpdij
log(μij) = α + log(Fmpdij)
No No Yes No Direct benchmark-driven calibration with several model families. Benchmark quality is critical and a global coefficient can miss local heterogeneity.
Bayesian multilevel modelling Model-based corridor correction Predicts OD-specific adjusted flows from a true-flow formula and a coverage-offset observation model. Fits a count model where active-user coverage enters as a fixed observation-process offset, then returns the estimated true-flow scale.
log E(Fmpdij) = log(qij) + ηtrueij
ηtrueij = α + Xiβo + Xjβd + γlog(dij)
Fadjij = exp(ηtrueij)
Yes Yes No Yes Richer structure, no benchmark OD needed for fitting, and a path to uncertainty-aware partial pooling. Heavier runtime and dependencies; posterior fitting requires sampler diagnostics and careful runtime checks.