Adjusting biases in origin-destination flows

Aim

This vignette describes the set of six adjustment methods included in debiasR, to correct mobile-phone-derived origin-destination mobility counts. debiasR implements five existing methods and a novel Bayesian multilevel modelling approach. The Bayesian approach comprises a set of functional form and model options which we explain in this and next vignette.

This vignette describes how each adjustment method works, associated functions and parameters, and how to implement it. The vignette focuses on how to implement the various functions using an small sample of the dataset introduced in the Getting set up. Since we are using a small sample of the dataset, the results in this vignette do not reflect the real performance of the methods. The next Validation vignette illustrates the comparative performance of the various methods. All the adjustment methods in debiasR use mobile-phone-derived aggregated flows as input data and return adjusted flows flow_adj.

The adjustment methods introduced here differ in fundamental ways, in terms of data input requirements, how they derive adjustment weights and the extent they can adjust biases in the data. These differences are described in more detail below.

While most existing flow adjustment approaches are trained on a target benchmark set of flows, the DEBIAS Bayesian approach does not impose this constraint. This is its biggest advantage, as the lack of timely and granular benchmark flow data from traditional data systems is normally the primary reason to use flows from digital sources.

Hence, methods relying on target benchmark data are of limited use in live scenarios of emergency, humanitarian or anticipatory action response. But ultimately, as discussed below, the practical choice of which methods to use should be guided by what data are available and validation against a credible benchmark.

Intuition

Let us quickly recall the challenge we face when working with mobile-phone-derived mobility flows. These flows typically capture movements associated with a small sample of the resident population, which may not be representative of the underlying population. The idea of the adjustment methods included in debiasR is to adjust the volume of the observed flows to make them more representative of the benchmark flows, which we consider as the ground truth. Formally, let $F^{mpd}_{ij}$ denote the observed mobile-phone-derived flow from origin $i$ to destination $j$ , and let $F^{adj}_{ij}$ denote the adjusted flow. If the observed MPD flow was under-counted with respect to the benchmark flow, a good adjustment methods, would increase $F^{adj}_{ij}$ with respect to $F^{mpd}_{ij}$ , to make it more similar to the benchmark flow. If the observed MPD flow was over-counted, the adjustment method would reduce $F^{adj}_{ij}$ .

Let us recall the notation:

Symbol	Description
$F^{mpd}_{ij}$	Observed MPD flow from origin $i$ to destination $j$ .
$F^{bench}_{ij}$	Benchmark flow from origin $i$ to destination $j$ .
$F^{adj}_{ij}$	Adjusted flow returned by an adjustment method.
$P_i, P_j$	Population at origin and destination.
$U_i, U_j$	Active-user count at origin and destination.
$c_i = U_i/P_i$	Active-user coverage rate.
$b_i = 1 - c_i$	Active-user coverage bias.
$d_{ij}$	Distance between origin $i$ and destination $j$ .
$X_i, X_j$	Origin and destination covariates, such as rural share.

Adjustment methods in `debiasR`

This vignette covers the six bias adjustment methods currently available in debiasR. The table below illustrates how they map onto debiasR functions and relevant references that have inspire our thinking. Below we explain how each of these methods adjusts biases and how you can implement them via debiasR.

Method	Function	Reference
Inverse Penetration Rate Weights	`adjust_inverse_penetration()`	Chi et al. (2025)
Selection Rate	`adjust_selection_rate()`	Chi et al. (2025)
Selection Rate II	`adjust_selection_rate2()`	Zagheni and Weber (2012)
Raking Ratio (or Iterative proportional fitting)	`adjust_raking_ratio()`	Chi et al. (2025)
Coefficient regression modelling	`adjust_coefficient()`	Chi et al. (2025)
Bayesian multilevel modelling	`adjust_multilevel_bayes()`	Rowe and Cabrera (in progress)

Data

We use the same empirical local authority district (LAD) travel-to-work example introduced in earlier vignettes. The observed origin-destination flow data comes from Locomizer-derived mobile phone data and the benchmark origin-destination flows from the 2021 UK Census. The data are provided through the debiasRdata companion package and described in more detail in Getting set up.

We start by loading the necessary packages:

library(debiasR)
library(debiasRdata)
library(dplyr)

We then identify the main data objects from debiasRdata.

OD_travel2work <- debiasRdata::lad_OD_travel2work
census_OD_travel2work <- debiasRdata::census_lad_OD_travel2work
coverage <- debiasRdata::coverage_lad |>
  dplyr::rename(area = code)
covariates <- debiasRdata::lad_covariates
centroids <- debiasRdata::lad_centroids

head(OD_travel2work)

     origin destination flow
1 E06000001   E06000001  854
2 E06000001   E06000002   33
3 E06000001   E06000003   14
4 E06000001   E06000004   95
5 E06000001   E06000005   15
6 E06000001   E06000009    1

head(census_OD_travel2work)

     origin destination  flow
1 E06000001   E06000001 14513
2 E06000001   E06000002  1500
3 E06000001   E06000003   671
4 E06000001   E06000004  4240
5 E06000001   E06000005   398
6 E06000001   E06000007     2

head(coverage)

  date                 name      area population user_count
1 2021           Hartlepool E06000001      92338       1207
2 2021        Middlesbrough E06000002     143926       1028
3 2021 Redcar and Cleveland E06000003     136531       1144
4 2021     Stockton-on-Tees E06000004     196595       1755
5 2021           Darlington E06000005     107799       1165
6 2021               Halton E06000006     128478       1744

For the adjustment examples, we use debiasR_example_data() to prepare a compact, aligned version of these data. It creates the complete-grid OD structure and distance table used by some adjustment methods below, which keeps the focus on how the debiasR functions work.

example_data <- debiasR_example_data(
  n_areas = 25,
  complete_grid = TRUE
)

mpd_od <- example_data$mpd_od
benchmark_od <- example_data$benchmark_od
coverage_adjustment <- example_data$coverage
covariates_adjustment <- example_data$covariates
distance <- example_data$distance

We structure the outputs in a standard format so they are easier to interpret and compare:

flow_mpd_raw: column reporting raw, unadjusted MPD flows used as input.
flow_mpd_adjusted: column reporting adjusted MPD-derived estimates returned by the adjustment method.
flow_benchmark: column reporting observed Census benchmark flows used as validation and as input by some of the adjustment methods.

benchmark_flow_lookup <- benchmark_od |>
  dplyr::select(
    origin,
    destination,
    flow_benchmark = flow
  )

show_adjustment_comparison <- function(adjusted_df,
                                       extra_cols = character(),
                                       n = 5) {
  adjusted_df |>
    dplyr::left_join(
      benchmark_flow_lookup,
      by = c("origin", "destination")
    ) |>
    dplyr::select(
      origin,
      destination,
      flow_mpd_raw = flow,
      flow_mpd_adjusted = flow_adj,
      flow_benchmark,
      dplyr::any_of(extra_cols)
    ) |>
    dplyr::slice_head(n = n)
}

Adjustment methods

Inverse penetration rate

The inverse penetration rate adjusts origin-destination flows by scaling observed mobile-phone-derived flows according to how well the population is represented in the mobile-phone data. For each area, we calculate the active-user coverage rate ( $c_i$ ) as the number of observed active users ( $U_i$ ) divided by the benchmark population ( $P_i$ ). The inverse of this rate becomes the adjustment weight ( $w_i$ ). Areas with lower coverage receive larger weights because their flows are assumed to be more undercounted. For example, if only 5% of residents in an origin area appear in the mobile-phone data, flows from that origin can be scaled up by a factor of 20; that is $1 / 0.05 = 20$ , so each observed flow is treated as representing about twenty times as many people in the benchmark population. The adjustment can use origin coverage, destination coverage or a combination of both. This makes the method simple and transparent, but it assumes that coverage bias is mainly proportional to population coverage and does not directly account for differences in who is represented among mobile-phone users. Formally:

Penetration rate and inverse penetration rate weight:

$c_i = \frac{U_i}{P_i}, \qquad w_i = \frac{1}{c_i} = \frac{P_i}{U_i}$

Origin-weighted adjustment:

$F^{adj}_{ij} = w_i F^{mpd}_{ij}$

Destination-weighted adjustment:

$F^{adj}_{ij} = w_j F^{mpd}_{ij}$

Both-side adjustment:

$F^{adj}_{ij} = \sqrt{w_iw_j}F^{mpd}_{ij}$

Implementation

To use adjust_inverse_penetration(), we input our observed mobile-phone-derived origin-destination flows via mpd_od_df and coverage data to coverage_df in tabular format. The coverage table should contain the benchmark population and active-user counts needed to calculate how strongly each area is underrepresented. The weight_by argument lets you choose where the correction is applied: “origin” adjusts flows using coverage at the origin, “destination” uses coverage at the destination and “both” combines both sides. The result is a data frame that preserves the original observed flow in flow and adds flow_adj, the adjusted flow after inverse penetration weighting.

If we implement inverse penetration weighting adjusting by origins only, we set weight_by to “origin” as done below. The resulting table compares the original observed MPD flow, the inverse-penetration-adjusted estimate and the Census benchmark flow. The difference between the adjusted and benchmark columns gives a quick descriptive view of the adjustment, though we assess it formally in the validation vignette.

adj_inverse_penetration <- adjust_inverse_penetration(
  mpd_od_df = mpd_od,
  coverage_df = coverage_adjustment,
  weight_by = "origin"
)

show_adjustment_comparison(adj_inverse_penetration)

# A tibble: 5 × 5
  origin    destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
  <chr>     <chr>              <dbl>             <dbl>          <dbl>
1 E06000011 E06000011           2403           53758.           57155
2 E06000011 E06000016              1              22.4             10
3 E06000011 E06000018              1              22.4             15
4 E06000011 E06000023              2              44.7              3
5 E06000011 E06000047              6             134.              11

This method is a useful first adjustment when the main known source of bias is uneven population coverage across areas. It is especially helpful when we have reliable benchmark population counts, active-user counts and benchmark origin-destination flows. In that setting, inverse penetration weighting provides a transparent way to rescale observed MPD flows so that areas with lower mobile-phone coverage contribute more strongly to the adjusted flow estimates. Its main limitation is that it treats coverage as the correction mechanism. It assumes that the observed users in an area are broadly representative of the residents who are missing from the data. However, it does not adjust for representativeness bias. It cannot directly account for social, demographic, behavioural or spatial differences in who appears in the MPD source. If mobile-phone users are systematically different from non-users, or if the relationship between coverage and mobility varies across places, this method may reduce undercounting while still leaving important selection bias in the adjusted flows.

Selection rate

Selection rate adjusts origin-destination flows by allowing the adjustment weight to depend on both population coverage and a covariate that helps explain who appears in the mobile-phone-derived data. The intuition is that two areas can have the same active-user coverage rate but still differ in the kinds of people represented in the MPD source. The selection-rate method therefore adds an area characteristic, such as rural share, to the coverage correction. It can also use benchmark origin-destination flows to calibrate the selection parameter, so that the adjusted flows are closer to a trusted benchmark when one is available. This makes the method more flexible than inverse penetration weighting, but it also makes the result more dependent on the quality of the chosen covariate and the calibration choice. Formally:

Origin-side selection weight:

$w_i^{(O)}(r_t) = \frac{1}{I_i c_i^{(O)} + (1 - I_i) r_t}$

Destination-side selection weight:

$w_j^{(D)}(r_t) = \frac{1}{I_j c_j^{(D)} + (1 - I_j) r_t}$

Adjusted flow:

$F^{adj}_{ij} = F^{mpd}_{ij}w_{ij}$

where $I_i$ and $I_j$ are covariate values; $r_t$ is the selection-rate parameter; and, $w_{ij}$ is the origin, destination or both-side correction.

Implementation

To use adjust_selection_rate(), we again input the observed MPD origin-destination flows through mpd_od_df and the population coverage table through coverage_df. The covariates_df and covariate_col arguments tell the function which area-level characteristic to use when estimating differential selection. In this example, we use rural_pct, the rural land-share covariate from debiasRdata. While all other parameters in adjust_selection_rate() have defaults, we recommend users to specify benchmark_od_df for a meaningful selection-rate adjustment. This data frame corresponds to a benchmark OD table used to calibrate the selection rate parameter r in the equations above.

What about if you do not have benchmark flow?

adjust_selection_rate() has two no-benchmark routes:

you can provide r_global directly based on prior evidence, substantive knowledge or a sensitivity-analysis design.
if both benchmark_od_df and r_global are omitted, debiasR uses a fallback based on the overall active-user coverage rate U / P. This makes the method executable, but the result should be interpreted as an exploratory adjustment rather than a calibrated estimate.

Our recommendation is to calibrate with a credible benchmark_od_df if available. If no benchmark exists, we recommend treating r_global as a sensitivity parameter. Report the fallback value, test a small set of plausible alternatives and validate the adjusted flows against any aggregate or available external evidence.

To calibrate MPD flows, calibration_aggregate is set to origin. This means that MDP flows are calibrated against origin-level benchmark totals. This function tries each value in a grid of r values (r_grid), adjusts the MPD flows, sums adjusted and benchmark flows by origin, and chooses the r_t that minimises the absolute difference in origin totals. r_grid is the set of candidate values that adjust_selection_rate() tries when it needs to calibrate the global selection parameter r_t. If calibration_aggregate is set to od, it skips that origin aggregation. Instead, for each candidate r_t, it compares each adjusted OD cell directly with the benchmark OD cell (as described in the note below), and picks the r_t with the smallest total OD-level error. The weight_by argument controls if the correction is applied using origin coverage, destination coverage or both.

adj_selection_rate <- adjust_selection_rate(
  mpd_od_df = mpd_od,
  coverage_df = coverage_adjustment,
  covariates_df = covariates_adjustment,
  covariate_col = "rural_pct",
  weight_by = "origin",
  benchmark_od_df = benchmark_od,
  calibration_aggregate = "origin"
)

show_adjustment_comparison(adj_selection_rate)

# A tibble: 5 × 5
  origin    destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
  <chr>     <chr>              <dbl>             <dbl>          <dbl>
1 E06000011 E06000011           2403           54984.           57155
2 E06000011 E06000016              1              22.9             10
3 E06000011 E06000018              1              22.9             15
4 E06000011 E06000023              2              45.8              3
5 E06000011 E06000047              6             137.              11

tibble::tibble(r_global = attr(adj_selection_rate,
                               "r_global"))

# A tibble: 1 × 1
  r_global
     <dbl>
1     0.04

The table below reports the original MPD flow, selection-rate adjusted estimate and Census benchmark flow. Because this example uses weight_by = "origin", all flows leaving a given origin are adjusted using that origin’s coverage and selected covariate value. The reported r_global attribute is the calibrated selection-rate parameter chosen by the function. This method is useful when coverage alone is too simple and we believe that the data representativeness varies prominently with an area characteristic. Its main limitation is that the chosen covariate must be meaningful. If the covariate is weakly related to selection into the MPD source, the adjustment may add complexity without reducing coverage or selection bias. The weight_missing column in the adj_selection_rate is a diagnostic label. If origin coverage data are missing and/or invalid, this leads to missing and/or invalid weight inputs.

Selection rate II

Selection rate II adjusts origin-destination flows by applying a compact coverage correction controlled by a single curvature parameter, $k$ . The method keeps the focus on population coverage. Instead of using the simple inverse of the coverage rate, it transforms the coverage rate through a nonlinear correction factor. This gives users a middle ground between inverse penetration weighting and the fuller covariate-based selection-rate model. When benchmark origin-destination flows are available, debiasR can calibrate $k$ so that the adjusted flows better match a trusted comparison. Formally:

Coverage correction factor:

$CF(c; k) = c\frac{\exp(-k) - 1}{\exp(-kc) - 1}$

where $CF(c; k)$ is the coverage correction factor, $c$ is the active-user coverage rate for the selected origin or destination area, and $k$ is the curvature parameter that controls how strongly coverage is transformed.

Adjusted flow:

$F^{adj}_{ij} = F^{mpd}_{ij}CF(c; k)$

The same correction can be applied by origin, destination or both, depending on weight_by.

Implementation

To use adjust_selection_rate2(), the function requires MPD flows and coverage data as inputs. Input the observed MPD flows to mpd_od_df and coverage data to coverage_df. The weight_by argument controls whether the correction uses origin coverage, destination coverage or both. You can set k directly if you have a chosen correction curvature, or leave it as NULL and provide benchmark_od_df so the function calibrates k over k_grid. As above, calibration_aggregate = "origin" calibrates against origin-level benchmark totals. The parameter k is the analog to r in adjust_selection_rate().

adj_selection_rate2 <- adjust_selection_rate2(
  mpd_od_df = mpd_od,
  coverage_df = coverage_adjustment,
  weight_by = "origin",
  benchmark_od_df = benchmark_od,
  calibration_aggregate = "origin"
)

show_adjustment_comparison(adj_selection_rate2)

# A tibble: 5 × 5
  origin    destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
  <chr>     <chr>              <dbl>             <dbl>          <dbl>
1 E06000011 E06000011           2403          2292.             57155
2 E06000011 E06000016              1             0.954             10
3 E06000011 E06000018              1             0.954             15
4 E06000011 E06000023              2             1.91               3
5 E06000011 E06000047              6             5.72              11

tibble::tibble(k = attr(adj_selection_rate2, "k"))

# A tibble: 1 × 1
      k
  <dbl>
1   0.1

The output table reports the observed MPD count, adjusted flow and Census benchmark flow. In this example, the correction applied to a flow depends on the origin area’s coverage rate and calibrated $k$ value. The reported k attribute shows the curvature parameter used by the adjustment. This method is useful when coverage is the dominant concern and a compact correction is adequate. Its main limitation is that one parameter may be too simple when selection bias varies strongly across places or population groups, and no benchmark data exist.

Raking ratio

Raking ratio is also known as iterative proportional fitting (IPF). Raking ratio adjusts origin-destination flows by iteratively rescaling the observed MPD matrix so that its origin and destination totals match trusted benchmark margins. The intuition is different from coverage weighting. Instead of calculating a weight from active-user coverage, the method asks the adjusted origin-destination matrix to satisfy known aggregate constraints. If we trust total outflows from each origin and total inflows to each destination, IPF updates the OD cells until the row and column totals agree with those benchmarks. Formally:

Margin constraints:

$\sum_j F^{adj}_{ij} = O_i^{bench}, \qquad \sum_i F^{adj}_{ij} = D_j^{bench}$

where $O_i^{bench}$ and $D_j^{bench}$ are benchmark outflow and inflow totals.

Implementation

To use adjust_raking_ratio(), the function requires MDP flows and set of constraints as inputs. Feed the observed MPD flows to mpd_od_df and either provide benchmark_od_df or supply explicit origin_targets and destination_targets. We use benchmark_od_df, so debiasR derives the target origin and destination margins from the benchmark OD table.

The max_iter and tol arguments control the iterative fitting process. max_iter sets the maximum number of update cycles and tol sets the convergence tolerance.

adj_raking_ratio <- adjust_raking_ratio(
  mpd_od_df = mpd_od,
  benchmark_od_df = benchmark_od
)

show_adjustment_comparison(adj_raking_ratio)

# A tibble: 5 × 5
  origin    destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
  <chr>     <chr>              <dbl>             <dbl>          <dbl>
1 E06000011 E06000011           2403           52864.           57155
2 E06000011 E06000016              1              26.3             10
3 E06000011 E06000018              1              20.4             15
4 E06000011 E06000023              2              57.2              3
5 E06000011 E06000047              6             137.              11

adj_raking_ratio |>
  dplyr::summarise(
    raw_total = sum(flow, na.rm = TRUE),
    adjusted_total = sum(flow_adj, na.rm = TRUE),
    median_weight = median(weight_ipf, na.rm = TRUE),
    ipf_converged = attr(adj_raking_ratio, "ipf_converged"),
    ipf_iterations = attr(adj_raking_ratio, "ipf_iterations")
  )

# A tibble: 1 × 5
  raw_total adjusted_total median_weight ipf_converged ipf_iterations
      <dbl>          <dbl>         <dbl> <lgl>                  <int>
1     80695        2302435          27.9 TRUE                     172

The output table shows how individual observed MPD flows have been converted into adjusted flows and how those adjusted flows compare with the Census benchmark. The summary reports whether the IPF routine converged, how many iterations it used, and how much the total flow changed after adjustment. Raking is useful when trusted marginal totals are available and consistency with those totals is a priority. Its main limitation is that it requires observed benchmark origin and destination flow data. A second limitation is that matching origin and destination margins does not guarantee that the internal allocation between specific origin-destination pairs is correct, so it should still be validated at the OD-pair level.

Coefficient regression modelling

Coefficient regression modelling adjusts origin-destination flows by learning a direct empirical relationship between observed MPD flows and benchmark flows. The method is useful when a trusted benchmark OD table is available for the same origin-destination pairs. Instead of using population coverage or margins, it fits a regression model that estimates how MPD counts map onto benchmark counts, then applies that fitted relationship to the MPD table. In its simplest form, this is a single multiplicative coefficient. In count-model versions, the relationship is estimated through a log link. Flow measured as proportions or probability as, for example, the number of outflows over the population at risk could be used as input, as well. Formally:

Core proportional form:

$E(F^{bench}_{ij}) = \beta F^{mpd}_{ij}$

Count-model form:

$\log(\mu_{ij}) = \alpha + \log(F^{mpd}_{ij})$

so that:

$\mu_{ij} = e^{\alpha}F^{mpd}_{ij}, \qquad \beta = e^{\alpha}$

Implementation

To use adjust_coefficient(), the essential input requirements are the MPD and benchmark flows. Input the observed MPD flows to mpd_od_df and benchmark OD flows to benchmark_od_df. The model_family argument controls the regression family used to learn the MPD-to-benchmark relationship. This example uses "ols" for a simple proportional calibration. The level argument controls whether the coefficient is estimated at the OD, origin or destination level. Additional options, such as fit_intercept and by_source, let users decide whether to estimate an intercept or source-specific coefficients.

adj_coefficient <- adjust_coefficient(
  mpd_od_df = mpd_od,
  benchmark_od_df = benchmark_od,
  model_family = "ols",
  level = "od"
)

show_adjustment_comparison(adj_coefficient)

# A tibble: 5 × 5
  origin    destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
  <chr>     <chr>              <dbl>             <dbl>          <dbl>
1 E06000011 E06000011           2403           69441.           57155
2 E06000011 E06000016              1              28.9             10
3 E06000011 E06000018              1              28.9             15
4 E06000011 E06000023              2              57.8              3
5 E06000011 E06000047              6             173.              11

attr(adj_coefficient, "coef")

[1] 28.89774

The output table shows the original observed MPD flow, regression adjusted estimate and Census benchmark flow. The coefficient attribute reports the fitted scaling relationship used to convert MPD flows into benchmark-like flows. This method is easy to validate because the benchmark is part of the model-fitting process. Its main limitation is that it requires benchmark flow data, and it does not adjust for representativeness bias. A simple coefficient can miss local heterogeneity unless the chosen model family and estimation level capture the relevant structure.

Bayesian multilevel modelling

Bayesian multilevel modelling in debiasR is a flexible modelling framework. In simple terms, the model addresses the following question: given the observed MPD flow and the level of population coverage, what underlying flow is likely to have produced that observation? If sufficient information is available about the characteristics of the locations, the model can also use patterns shared across origins, destinations and OD pairs to improve the estimates.

Important! The Bayesian models are designed for data-scarce settings where benchmark OD flows are unavailable for adjustment, as often occurs during natural hazards, armed conflict or health emergencies. Unlike all the other adjustment methods introduced in this vignette, the Bayesian method can estimate adjusted flows without benchmark flow data and quantify the uncertainty of those estimates. If benchmark flows are available, they can still be used for external validation.

Therefore, the performance of Bayesian models should not only be assessed on better error performance over benchmark-calibrated methods, but also by considering the provision of uncertainty estimates and flexibility across data sources and time periods.

The models are implemented via the function adjust_multilevel_bayes() which enables three model variants and four input data scenarios (S1-S4). Detailed information about model variants and input data scenarios is provided in the advanced Bayesian adjustment vignette. The function also supports model_engine = "frequentist", which can help users explore model specifications, covariates, random effects and differences across data sources or time periods before running the more computationally demanding Bayesian models. A practical guide to the adjust_multilevel_bayes() variants is provided in the table below:

Model variant	Use it when	Intuition
`observation_model= "coverage_offset"`	Estimate real population flows from phone counts, accounting for how well each area is represented in the mobile-phone data.	The model estimates true flows while treating source/time coverage as a fixed observation offset.
`observation_model= "reduced_form"`	Adjust the observed phone counts without using coverage rates to estimate how many people those phone counts represent.	The model fits observed MPD flows directly and predicts a counterfactual MPD flow with the coverage bias term set to zero.
`observation_model= "latent_two_level"`	Estimate shared population flows when the same OD pairs are observed more than once, such as across different data sources or time periods.	The model estimates source-invariant latent OD or OD-time true-flow states and treats observed source/time rows as noisy measurements.
`model_engine= "frequentist"`	Run the frequentist option to check that the data, formulas and assumptions work before running the full Bayesian model.	This option uses the same non-latent input structure for rapid specification checks and method comparison.

In this section, we focus on illustrating the default model with variant coverage-offset and the input data scenario S1, which relies on a single source of MPD data and a single time period. The coverage-offset model has two components: an MPD observation model and a true-flow model. The observation model describes how an underlying population flow becomes partially visible in the MPD data. The true-flow model estimates the population mobility pattern behind the observed MPD counts.

Component 1: MPD observation model. The MPD observation model describes how an underlying population flow becomes partially visible in the MPD data:

$F^{mpd}_{ij} \sim \operatorname{Poisson} \left(q_{ij}\lambda^{true}_{ij}\right)$

Here, $F^{mpd}_{ij}$ is the observed MPD flow from origin $i$ to destination $j$ . The term $\lambda^{true}_{ij}$ is the expected underlying population flow for the same OD pair. We use $\lambda^{true}_{ij}$ rather than $F^{true}_{ij}$ because this is the expected rather than observed true flow, which may not be a whole number. The term $q_{ij}$ is the proportion of the underlying flow expected to be visible in the MPD data. Here, we use the Poisson distribution, as it allows the observed count to vary around its expected value rather than assuming that it must equal that value exactly. If the data show greater variation or more zero flows, adjust_multilevel_bayes() also supports negative binomial, zero-inflated Poisson and zero-inflated negative binomial models. The expected observed MPD flow is therefore: $E[F^{mpd}_{ij}] = q_{ij}\lambda^{true}_{ij}$ , or taking logarithms for a linearised form:

$\log( E\left[F^{mpd}_{ij}\right]) = \log(q_{ij})+\eta^{true}_{ij}$

where $\lambda^{true}_{ij} = \exp(\eta^{true}_{ij})$ . The coverage rate $q_{ij}$ is supplied as known information about the observation process. It tells the model what proportion of the underlying flow the MPD system is expected to observe. Then, for example, if the expected underlying population flow is 500 and $q_{ij}=0.20$ , the model expects to observe approximately $0.20\times500=100$ movements in the MPD data. When fitting the model, the reasoning works in the opposite direction, as the observed MPD count and the coverage rate are used to estimate the underlying true population flow that could plausibly have produced that observation.

In adjust_multilevel_bayes(), the coverage_scale argument determines how $q_{ij}$ is calculated. If coverage_scale = "origin", then $q_{ij}=c_i=U_i/P_i$ , where $U_i$ is the observed active-user count and $P_i$ is the benchmark population in the origin area. In this case, all flows leaving the same origin use that origin’s coverage rate. Other options are coverage_scale = "destination" ( $q_{ij} = c_j$ ) and coverage_scale = "both" ( $q_{ij} = \sqrt{c_i c_j}$ ).

Component 2: True-flow model. The second component estimates the expected underlying population flow for each OD pair. We model $\eta^{true}_{ij}$ , i.e. the log of the expected flow, as

$\eta^{true}_{ij} = \alpha + \beta_oX_i + \beta_dX_j + \gamma\log d_{ij} + u_i + v_j + w_{ij} + \xi_{ij}$

where $\alpha$ represents the overall level of flow, the terms $X_i$ and $X_j$ represent characteristics of the origin and destination and $d_{ij}$ is the distance between them. The terms $u_i$ , $v_j$ and $w_{ij}$ give the model a multilevel structure and allow the model to represent patterns shared by flows from the same origin, to the same destination or repeated observations of the same OD pair. When included, flows from related observations can help inform one another. For example, if only a few movements are observed for one OD pair, the model can also consider the general pattern among flows from the same origin or to the same destination. The estimate for that OD pair is therefore based partly on its own data and partly on patterns found in related flows. This can lead to more stable estimates when some observed flows are small or highly variable. After combining all the terms in the above equation, an exponential transformation can be applied to obtain a positive expected population flow, $\lambda^{true}_{ij}$ .

How are the unknown quantities in Component 1 and 2 estimated?. Following a Bayesian approach, adjust_multilevel_bayes() assigns prior distributions to the model parameters, including the regression coefficients and any multilevel effects if present in the model. These priors describe plausible parameter values before the observed MPD flows are considered. The function then combines the prior distributions with the observed MPD flows through the observation model to generate posterior distributions of the parameter values.

$\text{prior distributions} + \text{observed MPD flows} \longrightarrow \text{posterior distributions}$

To generate a posterior distribution of the adjusted flow, the function adjust_multilevel_bayes() samples many sets of parameter values from their posterior distributions. For each sample, known as posterior draw $s$ , a corresponding expected underlying flow is calculated as $F^{adj,(s)}_{ij}$ = $\lambda^{true,(s)}_{ij}$ = $\exp(\eta^{true,(s)}_{ij})$ . For example, different posterior draws might produce adjusted flows of 482, 510, 495 and 536 for the same OD pair. These represent different underlying flows that remain plausible given the observed MPD data, coverage rates, model specification and prior assumptions. The collection of draws forms the posterior distribution of the adjusted flow.

Implementation

The code below shows how to fit the Bayesian coverage-offset model and inspect its main outputs. When fitting the Bayesian models, it is important to use enough iterations and chains for the application and check the sampler diagnostics before interpreting the results. To use adjust_multilevel_bayes(), input the observed MPD flows to mpd_od_df, population coverage to coverage_df, area-level covariates to covariates_df and OD distances to distance_df. The example below uses the S1 scenario, with a single data source and a single observation period.

For the coverage-offset model variant, most users only need a small set of arguments. The mobility_formula defines the true-flow model described above. Here we use rural_pct_o, rural_pct_d and log_distance to represent origin covariates, destination covariates and distance. This example uses a fixed-effect intercept, but the model can accommodate multilevel terms such as random effects by origin (1 | origin) or by destination (1 | destination). The recommended mode for the observation model is target_scale = "true_flow" and observation_model = "coverage_offset". The wider parameter set is covered in the advanced Bayesian vignette.

Argument	Description
`mobility_formula`	Defines the true-flow predictor, such as origin covariates, destination covariates, distance and optional multilevel terms.
`target_scale = "true_flow"`	Returns adjusted flows on the estimated true-flow scale rather than on the MPD observation scale.
`observation_model = "coverage_offset"`	Treats MPD flows as coverage-scaled observations of true flows.
`coverage_scale`	Chooses whether the fixed coverage offset uses origin coverage, destination coverage or both.
`model_engine = "bayesian"`	Fits the posterior model. Use the frequentist option, `model_engine = "frequentist"`, first when you want a specification check.
`scenario`, `source_col`, `time_col`	Declare the source/time data structure. The example below uses S1: one source and one time period.
`prediction_scope`	Chooses whether predictions are returned only for observed MPD rows or for a supplied complete OD matrix.
`random_intercept`	Adds or omits origin, destination, OD, source or time pooling structure when the data support it.
`model_family`	Chooses the count likelihood family, such as Poisson or negative binomial.
`iter`, `chains`, `seed`	Control Bayesian sampling and reproducibility.

The example below uses a Bayesian fixed-effect true-flow model with an origin coverage offset.

mpd_s1 <- mpd_od |>
  dplyr::mutate(
    mpd_source = "operator_a",
    mpd_time = "2021_q1"
  )

coverage_s1 <- coverage_adjustment |>
  dplyr::mutate(
    mpd_source = "operator_a",
    mpd_time = "2021_q1"
  )

adj_multilevel <- adjust_multilevel_bayes(
  mpd_od_df = mpd_s1,
  coverage_df = coverage_s1,
  covariates_df = covariates_adjustment,
  distance_df = distance,
  mobility_formula = ~ rural_pct_o + rural_pct_d + log_distance,
  bias_formula = ~ bias_e_origin,
  target_scale = "true_flow",
  observation_model = "coverage_offset",
  coverage_scale = "origin",
  model_engine = "bayesian",
  scenario = "s1",
  source_col = "mpd_source",
  time_col = "mpd_time",
  repeated_observation = "none",
  prediction_scope = "complete_grid",
  random_intercept = "none",
  model_family = "poisson",
  flow_adj_summary = "median",
  include_flow_adj_draws = TRUE,
  iter = 1000,
  chains = 2,
  seed = 123,
  refresh = 0
)

origin	destination	flow	flow_adj	flow_adj_median	flow_adj_mean	flow_mpd_pred	observation_probability	flow_benchmark
E06000011	E06000011	2403	82289.89	82289.89	82299.34	3678.41	0.04	57155
E06000011	E06000016	1	506.20	506.20	506.70	22.63	0.04	10
E06000011	E06000018	1	523.64	523.64	524.08	23.41	0.04	15
E06000011	E06000023	2	461.64	461.64	462.10	20.64	0.04	3
E06000011	E06000047	6	508.61	508.61	508.49	22.73	0.04	11

The table above reports the main outputs from adjust_multilevel_bayes(), along with flow_benchmark reporting the benchmark flows for comparison. The column flow is the observed MPD count and flow_adj is the main adjusted flow estimate. The function call sets flow_adj_summary = "median", so flow_adj equals flow_adj_median in this example. It is also possible to set flow_adj_summary = "mean". The columns flow_adj_median and flow_adj_mean report two posterior summaries of the estimated population flow. The column flow_mpd_pred reports the fitted value on the MPD-observed scale, after applying the coverage offset. The table also includes flow_true_pred, which in this example, is equal to flow_adj, so the displayed table keeps flow_adj as the single main population-scale adjusted estimate. Furthemore, the predicted median and mean flow in the MPD data and confidence intervals for the adjusted flows are reported under flow_mpd_pred_median, flow_mpd_pred_mean, flow_adj_q2.5, flow_adj_q97.5, flow_mpd_pred_q2.5 and flow_mpd_pred_q97.5. The example uses moderate sampler settings for illustrative purposes. For applied analysis, it is recommended to increase the number of iterations and chains.

You can inspect the results metadata to confirm the scenario, model variant, coverage scale and other attributes of model setup.

attr(adj_multilevel, "result_metadata")

field	value
model_engine	bayesian
backend	rstanarm
target_scale	true_flow
observation_model	coverage_offset
coverage_scale	origin
offset_column	log_observation_probability
scenario	s1
repeated_observation	none
n_sources	1
n_time_periods	1
prediction_scope	complete_grid
flow_adj_summary	median

It is also possible to inspect the model terms, which include the specific model equations used for modelling.

attr(adj_multilevel, "model_terms")

component	value
formula	flow ~ rural_pct_o + rural_pct_d + log_distance + offset(log_observation_probability)
formula_source	split_formula
formula_interface	split_true_flow
mobility_formula	~rural_pct_o + rural_pct_d + log_distance
bias_formula	~bias_e_origin
mobility_variables	rural_pct_o, rural_pct_d, log_distance
bias_variables	NA
user_formula	TRUE
custom_formula	FALSE
default_area_covariate	rural_pct
default_fixed_effects	NA
scenario_fixed_effects	NA
formula_variables	flow, rural_pct_o, rural_pct_d, log_distance, log_observation_probability
requested_random_intercept	none
random_effect_term	NA
formula_random_effects	NA

Finally, inspecting sampler diagnostics can be helpful before interpreting the posterior summaries.

attr(adj_multilevel, "diagnostics")

Fit all methods

We can also run all six adjustment methods at once using adjust_all_methods(). The function uses the frequentist option for the multilevel method by default, which compares the main adjustment methods without posterior sampling. You can switch multilevel_engine to "bayesian" for more complex structures.

method_results <- adjust_all_methods(
  mpd_od_df = mpd_od,
  coverage_df = coverage_adjustment,
  benchmark_od_df = benchmark_od,
  covariates_df = covariates_adjustment,
  distance_df = distance,
  covariate_col = "rural_pct",
  multilevel_engine = "frequentist"
)

# A tibble: 6 × 4
  method              engine         rows adjusted_total
  <chr>               <chr>         <int>          <dbl>
1 inverse_penetration deterministic   625       2299443.
2 selection_rate      deterministic   625       2055645.
3 selection_rate2     deterministic   625         76931.
4 raking_ratio        deterministic   625       2302435
5 coefficient         deterministic   625       2331903.
6 multilevel_bayes    frequentist     625         56235.

Method summary

As described above, each adjustment method in debiasR has different assumptions, data needs, strengths and limitations. The table below focuses on the main differences readers need when choosing a method: whether the method uses benchmark information during fitting, why the method is useful, and where it is most likely to fall short.

Method	Benchmark role	Rationale	Strengths	Limitations
Inverse Penetration Rate Weights	Not benchmark-trained. Uses benchmark population and active-user counts, but not benchmark OD flows.	Inflates observed flows using population-to-user ratios where coverage is thin.	Simple, transparent, fast, and does not require benchmark OD flows.	Can overinflate sparse areas and misses compositional selection bias.
Selection Rate	Optionally benchmark-assisted. Can calibrate the selection parameter against benchmark OD flows; otherwise uses coverage and covariates.	Uses coverage, covariates, and optional benchmark calibration to model differential selection.	More flexible than inverse penetration and can use benchmark calibration.	Depends on covariate quality, calibration choices, and a meaningful selection parameter.
Selection Rate II	Optionally benchmark-assisted. Can calibrate the global `k` parameter against benchmark OD flows; otherwise uses a supplied or default `k`.	Applies a compact coverage correction with optional `k` calibration.	Parsimonious, interpretable, and lighter than a covariate-rich adjustment.	Less flexible when bias varies strongly across areas.
Raking Ratio (or Iterative proportional fitting)	Benchmark-margin assisted. Uses benchmark origin and destination margins, or user-supplied targets, rather than fitting to each OD cell.	Uses iterative proportional fitting so adjusted flows match trusted origin and destination margins.	Guarantees consistency with supplied marginal totals.	Matching margins does not guarantee correct destination allocation within each origin.
Coefficient regression modelling	Benchmark-trained. Fits the MPD-to-benchmark relationship using benchmark OD cells.	Learns a direct mapping from MPD flows to benchmark flows and applies it to the MPD table.	Direct benchmark-driven calibration with several model families.	Benchmark quality is critical and a global coefficient can miss local heterogeneity.
Bayesian multilevel modelling	Not benchmark-trained. Uses coverage and covariates during fitting; benchmark OD flows are reserved for external validation.	Fits a count model where active-user coverage enters as a fixed observation-process offset, then returns the estimated true-flow scale.	Richer structure, no benchmark OD needed for fitting, posterior uncertainty, and a path to source/time-aware partial pooling.	Heavier runtime and dependencies; posterior fitting requires sampler diagnostics and should be compared externally against benchmark-calibrated methods.

Aim

Intuition

Adjustment methods in debiasR

Data

Adjustment methods

Inverse penetration rate

Implementation

Selection rate

Implementation

Selection rate II

Implementation

Raking ratio

Implementation

Coefficient regression modelling

Implementation

Bayesian multilevel modelling

Implementation

Fit all methods

Method summary

Adjustment methods in `debiasR`