Aim
This vignette describes the set of adjustment methods included in debiasR, to correct mobile-phone-derived origin-destination mobility counts.
The vignette describes how each adjustment method works, associated functions and parameters, and how to implement it. All the methods in debiasR use mobile-phone-derived flows as input data and return an adjusted flow flow_adj. However, they differ in fundamental ways, in terms of data input requirements, how they derive adjustment weights and the extent they can adjust biases in the data. These differences are described in more detail below. The practical choice of which methods to use should be guided by what data are available and validation against a credible benchmark. A key innovation of debiasR is the introduction of a Bayesian multilevel modelling approach which we also explain below.
Intuition
Let us quickly recall the challenge we face when working with mobile-phone-derived mobility flows. These flows are typically represent an small proportion of the resident population, hence we end up under-counting these populations. Though we can also over-count populations if individuals have multiple mobile devices and they are recorded in the data. So the idea of the methods included in debiasR is to adjust these counts to make these counts more representative of the actual observed flow. Formally, let denote the observed mobile-phone-derived flow from origin to destination , and let denote the adjusted flow. If a method treats the observed MPD flow as under-counted, it increases . If it treats the observed MPD flow as over-counted, it reduces .
Where, let us recall:
| Symbol | Description |
|---|---|
| Observed MPD flow from origin to destination . | |
| Benchmark flow from origin to destination . | |
| Adjusted flow returned by an adjustment method. | |
| Population at origin and destination. | |
| Active-user count at origin and destination. | |
| Active-user coverage rate. | |
| Active-user coverage bias. | |
| Distance between origin and destination . | |
| Origin and destination covariates, such as rural share. |
Methods in debiasR
This vignette covers the bias adjustment methods currently available in debiasR. The table below illustrates how they map onto debiasR functions and relevant references that have inspire our thinking. Below we explain how each of these methods adjust biases and how you can implement them via debiasR.
| Method | Function | Reference |
|---|---|---|
| Inverse Penetration Rate Weights | debiasR::adjust_inverse_penetration() |
Chi et al. (2025) |
| Selection Rate | debiasR::adjust_selection_rate() |
Chi et al. (2025) |
| Selection Rate II | debiasR::adjust_selection_rate2() |
Zagheni and Weber (2012) |
| Raking Ratio (or Iterative proportional fitting) | debiasR::adjust_raking_ratio() |
Chi et al. (2025) |
| Coefficient regression modelling | debiasR::adjust_coefficient() |
Chi et al. (2025) |
| Bayesian multilevel modelling | debiasR::adjust_multilevel_bayes() |
Rowe and Cabrera (in progress); Raymer et al. (2013); Aparicio Castro et al. (2024) |
Data
We use the same empirical local authority district (LAD) travel-to-work example introduced in the earlier vignettes. The observed origin-destination flow table comes from Locomizer-derived mobile phone data and the benchmark origin-destination flow table comes from the 2021 UK Census. The data are provided through the optional debiasRdata companion package and described in more detail in the Data vignette.
The README shows how to install debiasR and the companion debiasRdata package from GitHub. We start by loading those packages.
First, we identify the main LAD data objects directly from debiasRdata.
OD_travel2work <- debiasRdata::lad_OD_travel2work
census_OD_travel2work <- debiasRdata::census_lad_OD_travel2work
coverage <- debiasRdata::coverage_lad |>
dplyr::rename(area = code)
covariates <- debiasRdata::lad_covariates
centroids <- debiasRdata::lad_centroids
head(OD_travel2work) origin destination flow
1 E06000001 E06000001 854
2 E06000001 E06000002 33
3 E06000001 E06000003 14
4 E06000001 E06000004 95
5 E06000001 E06000005 15
6 E06000001 E06000009 1
head(census_OD_travel2work) origin destination flow
1 E06000001 E06000001 14513
2 E06000001 E06000002 1500
3 E06000001 E06000003 671
4 E06000001 E06000004 4240
5 E06000001 E06000005 398
6 E06000001 E06000007 2
head(coverage) date name area population user_count
1 2021 Hartlepool E06000001 92338 1207
2 2021 Middlesbrough E06000002 143926 1028
3 2021 Redcar and Cleveland E06000003 136531 1144
4 2021 Stockton-on-Tees E06000004 196595 1755
5 2021 Darlington E06000005 107799 1165
6 2021 Halton E06000006 128478 1744
For the adjustment examples, we use debiasR::debiasR_example_data() to prepare a small, aligned version of these data. This keeps the vignette fast to render and creates the complete-grid OD structure and distance table used by some methods below.
example_data <- debiasR::debiasR_example_data(
n_areas = 25,
complete_grid = TRUE
)
mpd_od <- example_data$mpd_od
benchmark_od <- example_data$benchmark_od
coverage_adjustment <- example_data$coverage
covariates_adjustment <- example_data$covariates
distance <- example_data$distanceThe method examples below use a common comparison view so the reported flow columns have the same meaning throughout the vignette:
-
flow_mpd_raw: the raw, unadjusted MPD flow used as input. -
flow_mpd_adjusted: the adjusted MPD-derived estimate returned by the adjustment method. -
flow_benchmark: the observed Census benchmark flow, shown for interpretation and validation.
benchmark_flow_lookup <- benchmark_od |>
dplyr::select(
origin,
destination,
flow_benchmark = flow
)
show_adjustment_comparison <- function(adjusted_df,
extra_cols = character(),
n = 5) {
adjusted_df |>
dplyr::left_join(
benchmark_flow_lookup,
by = c("origin", "destination")
) |>
dplyr::select(
origin,
destination,
flow_mpd_raw = flow,
flow_mpd_adjusted = flow_adj,
flow_benchmark,
dplyr::any_of(extra_cols)
) |>
dplyr::slice_head(n = n)
}Inverse penetration rate
The inverse penetration rate adjusts origin-destination flows by scaling observed mobile-phone-derived flows according to how well the population is represented in the mobile-phone data. For each area, we calculate the active-user coverage rate () as the number of observed active users () divided by the benchmark population (). The inverse of this rate becomes the adjustment weight (). Areas with lower coverage receive larger weights because their flows are assumed to be more undercounted. For example, if only 5% of residents in an origin area appear in the mobile-phone data, flows from that origin can be scaled up by a factor of 20; that is , so each observed flow is treated as representing about twenty times as many people in the benchmark population. The adjustment can use origin coverage, destination coverage or a combination of both. This makes the method simple and transparent, but it assumes that bias is mainly proportional to population coverage and does not directly account for differences in who is represented among mobile-phone users. Formally:
Penetration rate and inverse penetration rate weight:
Origin-weighted adjustment:
Destination-weighted adjustment:
Both-side adjustment:
Implementation
To use debiasR::adjust_inverse_penetration(), we input our observed mobile-phone-derived origin-destination flows via mpd_od_df and coverage data to coverage_df in tabular format. The coverage table should contain the benchmark population and active-user counts needed to calculate how strongly each area is underrepresented. The weight_by argument lets you choose where the correction is applied: “origin” adjusts flows using coverage at the origin, “destination” uses coverage at the destination and “both” combines both sides. The result is a data frame that preserves the original observed flow in flow and adds flow_adj, the adjusted flow after inverse penetration weighting.
If we implement inverse penetration weighting adjusting by origins only, we set weight_by to “origin” as done below. The resulting table compares the original observed MPD flow, the inverse-penetration-adjusted estimate and the Census benchmark flow. The difference between the adjusted and benchmark columns gives a quick descriptive view of the adjustment, though we assess it formally in the validation vignette.
adj_inverse_penetration <- debiasR::adjust_inverse_penetration(
mpd_od_df = mpd_od,
coverage_df = coverage_adjustment,
weight_by = "origin"
)
show_adjustment_comparison(adj_inverse_penetration)# A tibble: 5 × 5
origin destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
<chr> <chr> <dbl> <dbl> <dbl>
1 E06000011 E06000011 2403 53758. 57155
2 E06000011 E06000016 1 22.4 10
3 E06000011 E06000018 1 22.4 15
4 E06000011 E06000023 2 44.7 3
5 E06000011 E06000047 6 134. 11
This method is a useful first adjustment when the main known source of bias is uneven population coverage across areas. It is especially helpful when we have reliable benchmark population counts and active-user counts, but no trusted benchmark origin-destination flow table. In that setting, inverse penetration weighting provides a transparent way to rescale observed MPD flows so that areas with lower mobile-phone coverage contribute more strongly to the adjusted flow estimates. Its main limitation is that it treats coverage as the correction mechanism. It assumes that the observed users in an area are broadly representative of the residents who are missing from the data. It therefore cannot directly account for social, demographic, behavioural or spatial differences in who appears in the MPD source. If mobile-phone users are systematically different from non-users, or if the relationship between coverage and mobility varies across places, this method may reduce undercounting while still leaving important selection bias in the adjusted flows.
Selection rate
Selection rate adjusts origin-destination flows by allowing the adjustment weight to depend on both population coverage and a covariate that helps explain who appears in the mobile-phone-derived data. The intuition is that two areas can have the same active-user coverage rate but still differ in the kinds of people represented in the MPD source. The selection-rate method therefore adds an area characteristic, such as rural share, to the coverage correction. It can also use benchmark OD flows to calibrate the selection parameter, so that the adjusted flows are closer to a trusted benchmark when one is available. This makes the method more flexible than inverse penetration weighting, but it also makes the result more dependent on the quality of the covariate and the calibration choice. Formally:
Origin-side selection weight:
Destination-side selection weight:
Adjusted flow:
where and are covariate values; is the selection-rate parameter; and, is the origin, destination or both-side correction.
Implementation
To use debiasR::adjust_selection_rate(), we again input the observed MPD origin-destination flows through mpd_od_df and the population coverage table through coverage_df. The covariates_df and covariate_col arguments tell the function which area-level characteristic to use when estimating differential selection. In this example, we use rural_pct, the rural land-share covariate from debiasRdata. While all other parameters in debiasR::adjust_selection_rate() have defaults, we recommend users to include specify benchmark_od_df for a meaningful selection-rate adjustment. This data frame corresponds to a benchmark OD table used to calibrate the selection rate parameter r in the equations above.
What about if you do not have benchmark flow?
debiasR::adjust_selection_rate()has two no-benchmark routes:
you can provide
r_globaldirectly based on prior evidence, substantive knowledge or a sensitivity-analysis design.if both
benchmark_od_dfandr_globalare omitted,debiasRuses a fallback based on the overall active-user coverage ratesum(U) / sum(P). This makes the method executable, but the result should be interpreted as an exploratory adjustment rather than a calibrated estimate.Our recommendation is to calibrate with a credible
benchmark_od_dfwhenever one is available. If no benchmark exists, we recommend treatingr_globalas a sensitivity parameter. Report the fallback value, test a small set of plausible alternatives and validate the adjusted flows against any aggregate or external evidence that is available.
To calibrate MPD flows, calibration_aggregateis set to origin. This means that MDP flows are calibrated against origin-level benchmark totals. This function tries each value in a grid of r values (r_grid), adjusts the MPD flows, sums adjusted and benchmark flows by origin, and chooses the r_t that minimises the absolute difference in origin totals. r_grid is the set of candidate values that debiasR::adjust_selection_rate() tries when it needs to calibrate the global selection parameter r_t. If calibration_aggregate is set to od, it skips that origin aggregation. Instead, for each candidate r_t, it compares each adjusted OD cell directly with the benchmark OD cell (as described in the note below), and picks the r_t with the smallest total OD-level error. The weight_by argument controls whether the correction is applied using origin coverage, destination coverage or both.
adj_selection_rate <- debiasR::adjust_selection_rate(
mpd_od_df = mpd_od,
coverage_df = coverage_adjustment,
covariates_df = covariates_adjustment,
covariate_col = "rural_pct",
weight_by = "origin",
benchmark_od_df = benchmark_od,
calibration_aggregate = "origin"
)
show_adjustment_comparison(adj_selection_rate)# A tibble: 5 × 5
origin destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
<chr> <chr> <dbl> <dbl> <dbl>
1 E06000011 E06000011 2403 54984. 57155
2 E06000011 E06000016 1 22.9 10
3 E06000011 E06000018 1 22.9 15
4 E06000011 E06000023 2 45.8 3
5 E06000011 E06000047 6 137. 11
# A tibble: 1 × 1
r_global
<dbl>
1 0.04
The comparison table reports the original MPD flow, the selection-rate adjusted estimate and the Census benchmark flow. Because this example uses weight_by = "origin", all flows leaving a given origin are adjusted using that origin’s coverage and selected covariate value. The reported r_global attribute is the calibrated selection-rate parameter chosen by the function. This method is useful when coverage alone is too simple and we believe representation varies with an area characteristic. Its main limitation is that the chosen covariate must be meaningful: if the covariate is weakly related to selection into the MPD source, the adjustment may add complexity without reducing bias. The weight_missing column in the adj_selection_rate is a diagnostic flag if missing and / or invalid origin coverage data leads to missing and / or invalid weight inputs.
Selection rate II
Selection rate II adjusts origin-destination flows by applying a compact coverage correction controlled by a single curvature parameter, . The method keeps the focus on population coverage, but instead of using the simple inverse of the coverage rate, it transforms the coverage rate through a nonlinear correction factor. This gives users a middle ground between inverse penetration weighting and the fuller covariate-based selection-rate model. When benchmark OD flows are available, debiasR can calibrate so that the adjusted flows better match a trusted comparison. Formally:
Coverage correction factor:
where is the coverage correction factor, is the active-user coverage rate for the selected origin or destination area, and is the curvature parameter that controls how strongly coverage is transformed.
Adjusted flow:
The same correction can be applied on the origin side, destination side, or both sides depending on weight_by.
Implementation
To use debiasR::adjust_selection_rate2(), the strict minimum set of parameter to be specified are MPD flows and coverage data. Input the observed MPD flows to mpd_od_df and the coverage table to coverage_df. The weight_by argument controls whether the correction uses origin coverage, destination coverage or both. You can set k directly if you have a chosen correction curvature, or leave it as NULL and provide benchmark_od_df so the function calibrates k over k_grid. As above, calibration_aggregate = "origin" calibrates against origin-level benchmark totals. The parameter k is the analog to r in debiasR::adjust_selection_rate().
adj_selection_rate2 <- debiasR::adjust_selection_rate2(
mpd_od_df = mpd_od,
coverage_df = coverage_adjustment,
weight_by = "origin",
benchmark_od_df = benchmark_od,
calibration_aggregate = "origin"
)
show_adjustment_comparison(adj_selection_rate2)# A tibble: 5 × 5
origin destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
<chr> <chr> <dbl> <dbl> <dbl>
1 E06000011 E06000011 2403 2292. 57155
2 E06000011 E06000016 1 0.954 10
3 E06000011 E06000018 1 0.954 15
4 E06000011 E06000023 2 1.91 3
5 E06000011 E06000047 6 5.72 11
# A tibble: 1 × 1
k
<dbl>
1 0.1
The comparison table reports the observed MPD count, the adjusted flow and the Census benchmark flow. In this origin-weighted example, the correction applied to a flow depends on the origin area’s coverage rate and the calibrated value. The reported k attribute shows the curvature parameter used by the adjustment. This method is useful when coverage is the dominant concern and a compact correction is adequate. Its main limitation is that one parameter may be too simple when selection bias varies strongly across places or population groups.
Raking ratio
Raking ratio is also known as iterative proportional fitting (IPF). Raking ratio adjusts origin-destination flows by iteratively rescaling the observed MPD matrix so that its origin and destination totals match trusted benchmark margins. The intuition is different from coverage weighting. Instead of calculating a weight from active-user coverage, the method asks the adjusted OD table to satisfy known aggregate constraints. If we trust total outflows from each origin and total inflows to each destination, IPF updates the OD cells until the row and column totals agree with those benchmarks. Formally:
Margin constraints:
where and are benchmark outflow and inflow totals.
Implementation
To use debiasR::adjust_raking_ratio(), the minimum set of parameter to specify is the MDP flows and set of constraints. Input the observed MPD flows to mpd_od_df and either provide benchmark_od_df or supply explicit origin_targets and destination_targets. In this vignette, we use benchmark_od_df, so debiasR derives the target origin and destination margins from the benchmark OD table.
The max_iter and tol arguments control the iterative fitting process: max_iter sets the maximum number of update cycles and tol sets the convergence tolerance.
adj_raking_ratio <- debiasR::adjust_raking_ratio(
mpd_od_df = mpd_od,
benchmark_od_df = benchmark_od
)
show_adjustment_comparison(adj_raking_ratio)# A tibble: 5 × 5
origin destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
<chr> <chr> <dbl> <dbl> <dbl>
1 E06000011 E06000011 2403 52864. 57155
2 E06000011 E06000016 1 26.3 10
3 E06000011 E06000018 1 20.4 15
4 E06000011 E06000023 2 57.2 3
5 E06000011 E06000047 6 137. 11
adj_raking_ratio |>
dplyr::summarise(
raw_total = sum(flow, na.rm = TRUE),
adjusted_total = sum(flow_adj, na.rm = TRUE),
median_weight = median(weight_ipf, na.rm = TRUE),
ipf_converged = attr(adj_raking_ratio, "ipf_converged"),
ipf_iterations = attr(adj_raking_ratio, "ipf_iterations")
)# A tibble: 1 × 5
raw_total adjusted_total median_weight ipf_converged ipf_iterations
<dbl> <dbl> <dbl> <lgl> <int>
1 80695 2302435 27.9 TRUE 172
The first table shows how individual observed MPD flows have been converted into adjusted flows and how those adjusted flows compare with the Census benchmark. The summary reports whether the IPF routine converged, how many iterations it used, and how much the total flow changed after adjustment. Raking is useful when trusted marginal totals are available and consistency with those totals is a priority. Its main limitation is that matching origin and destination margins does not guarantee that the internal allocation between specific origin-destination pairs is correct, so it should still be validated at the OD-pair level.
Coefficient regression modelling
Coefficient regression modelling adjusts origin-destination flows by learning a direct empirical relationship between observed MPD flows and benchmark flows. The method is useful when a trusted benchmark OD table is available for the same origin-destination pairs. Instead of using population coverage or margins, it fits a regression model that estimates how MPD counts map onto benchmark counts, then applies that fitted relationship to the MPD table. In its simplest form, this is a single multiplicative coefficient. In count-model versions, the relationship is estimated through a log link. Flow measured as proportions or probability as, for example, the number of outflows over the population at risk could be used as input, as well. Formally:
Core proportional form:
Count-model form:
so that:
Implementation
To use debiasR::adjust_coefficient(), the minimum input requirement are the MPD and benchmark flows. Input the observed MPD flows to mpd_od_df and a benchmark OD table to benchmark_od_df. The model_family argument controls the regression family used to learn the MPD-to-benchmark relationship; this example uses "ols" for a simple proportional calibration. The level argument controls whether the coefficient is estimated at the OD, origin or destination level. Additional options, such as fit_intercept and by_source, let users decide whether to estimate an intercept or source-specific coefficients.
adj_coefficient <- debiasR::adjust_coefficient(
mpd_od_df = mpd_od,
benchmark_od_df = benchmark_od,
model_family = "ols",
level = "od"
)
show_adjustment_comparison(adj_coefficient)# A tibble: 5 × 5
origin destination flow_mpd_raw flow_mpd_adjusted flow_benchmark
<chr> <chr> <dbl> <dbl> <dbl>
1 E06000011 E06000011 2403 69441. 57155
2 E06000011 E06000016 1 28.9 10
3 E06000011 E06000018 1 28.9 15
4 E06000011 E06000023 2 57.8 3
5 E06000011 E06000047 6 173. 11
attr(adj_coefficient, "coef")[1] 28.89774
The comparison table shows the original observed MPD flow, the regression adjusted estimate and the Census benchmark flow. The coefficient attribute reports the fitted scaling relationship used to convert MPD flows into benchmark-like flows. This method is direct and easy to validate because the benchmark is part of the model-fitting process. Its main limitation is that benchmark quality is critical, and a simple coefficient can miss local heterogeneity unless the chosen model family and estimation level capture the relevant structure.
Bayesian multilevel modelling
Bayesian multilevel modelling adjusts origin-destination flows by fitting a structured count model to the observed MPD flows and estimating the underlying true-flow scale. This vignette focuses on the recommended default implementation: the coverage-offset true-flow model. This mode treats MPD flows as coverage-scaled observations of true flows, rather than as flows that are only adjusted after fitting by switching off a bias covariate. The multilevel structure can borrow information across origins, destinations and corridors when those data structures are available. The Bayesian implementation can take time, so the vignette shows the model code but renders a precomputed example output. For analysis, run enough iterations and chains for your data and check the sampler diagnostics before interpreting the posterior summaries.
adjust_multilevel_bayes() also includes advanced Bayesian modes for repeated source/time data, backward compatibility and fast structure checking. The guide below gives the practical choice points, while the advanced Bayesian adjustment vignette keeps the deeper formulas and diagnostic detail.
A practical guide is:
| Option | Use it when | Returned scale | What it returns |
|---|---|---|---|
observation_model = "coverage_offset" |
You want the default Bayesian true-flow adjustment. | True-flow scale | MPD counts are treated as coverage-scaled observations of true flows. Start here. |
observation_model = "reduced_form" |
You need backward compatibility or an MPD-scale sensitivity check. | MPD-scale counterfactual | The model fits observed MPD flows directly and predicts a zero-bias counterfactual. |
observation_model = "latent_two_level" |
You have repeated source/time observations, especially S3 or S4 structures. | Latent true-flow scale | An experimental Stan backend estimates a shared latent OD or OD-time true-flow state. |
model_engine = "frequentist" |
You want to test S1-S4 data structure quickly before Bayesian sampling. | Fast non-posterior check | This is a fast design/checking engine, not posterior inference. |
For routine adjustment, the practical default is:
| Target | Recommended setting |
|---|---|
| Adjustment scale | target_scale = "true_flow" |
| Observation model | observation_model = "coverage_offset" |
| Coverage offset |
coverage_scale = "origin", "destination" or "both"
|
In coverage-offset true-flow mode, the model can be written as:
Level 1: MPD observation model
where is the observation probability derived from active-user coverage. With coverage_scale = "origin", . The term is an offset: it fixes the observation-process scale and is not estimated as a regression coefficient.
Level 2: true mobility-flow model
where and are origin and destination covariates, is the origin-destination distance, and , and capture origin, destination and origin-destination structure.
Intuitively, Level 1 recognises that we observe MPD flows, not the true mobility flows themselves. The observed MPD flow is treated as a coverage-scaled measurement of an underlying true flow . If an origin has 20% active-user coverage, the fitted MPD-scale expectation is the true-flow expectation multiplied by 0.20 under the origin-offset assumption. For example, if the model estimates 500 true trips from origin to destination and origin has 20% active-user coverage, the corresponding MPD observation-scale prediction is trips.
Level 2 describes the underlying mobility pattern learned from covariates, distance and origin/destination/corridor structure. It assumes that the true flow depends on origin and destination characteristics, distance, and persistent origin, destination and corridor structure. This gives the model a way to borrow information across related places instead of treating every OD corridor as a completely separate unit.
is not itself a random intercept. It is the estimated true-flow scale for a specific origin-destination pair, computed from the full true-flow linear predictor . Random intercepts, when included, are components of that predictor. An origin random intercept shifts all flows leaving origin up or down; a destination random intercept shifts flows arriving at destination ; and an OD random intercept captures a corridor-specific deviation. In the small S1 example below we set random_intercept = "none" because each corridor appears once, so is estimated from the fixed-effect predictors and the coverage offset rather than from random intercepts.
The model adjusts flows by predicting on two scales. The fitted MPD-scale prediction keeps the coverage offset:
The adjusted true-flow prediction removes the observation offset:
In a Bayesian framework, these predictions are computed draw-by-draw from the posterior and summarised by mean or median. This means flow_adj is a posterior summary of the true-flow prediction, not a single deterministic rescaling of the observed MPD count.
Implementation
To use debiasR::adjust_multilevel_bayes(), input the observed MPD flows to mpd_od_df, the population coverage table to coverage_df, area-level covariates to covariates_df and OD distances to distance_df. The compact default example below uses a single source and a single observation period. The prediction_scope argument controls whether the method adjusts only observed MPD pairs or predicts over a supplied complete OD grid.
| Input | What it supplies |
|---|---|
mpd_od_df |
Observed MPD origin-destination counts. |
coverage_df |
Active-user coverage rates used to build the observation offset. |
covariates_df |
Origin and destination predictors for the true-flow model. |
distance_df |
OD distances or distance-derived predictors. |
The equations above translate into two conceptual formula components. mobility_formula defines the true-flow model; here we use rural_pct_o, rural_pct_d and log_distance to represent origin covariates, destination covariates and distance. In the recommended default mode, the observation model is selected with target_scale = "true_flow" and observation_model = "coverage_offset", while coverage_scale selects whether the offset uses origin, destination or geometric-mean coverage. The coverage component is represented by this fixed offset rather than by an estimated coverage-bias coefficient in the fitted true-flow formula.
The default coverage-offset model is the safest Bayesian starting point because it makes the observation process explicit while keeping the fitted structure readable. adjust_multilevel_bayes() does not need benchmark OD flows to fit this model. In this vignette, benchmark flows are shown next to the raw and adjusted MPD-derived estimates so that readers can interpret and validate the adjustment.
The example below therefore uses a Bayesian fixed-effect true-flow model with an origin coverage offset. The model code is shown for reproducibility, while the rendered table is loaded from a precomputed package artifact so routine vignette updates do not rerun MCMC.
mpd_s1 <- mpd_od |>
dplyr::mutate(
mpd_source = "operator_a",
mpd_time = "2021_q1"
)
coverage_s1 <- coverage_adjustment |>
dplyr::mutate(
mpd_source = "operator_a",
mpd_time = "2021_q1"
)
adj_multilevel <- debiasR::adjust_multilevel_bayes(
mpd_od_df = mpd_s1,
coverage_df = coverage_s1,
covariates_df = covariates_adjustment,
distance_df = distance,
mobility_formula = ~ rural_pct_o + rural_pct_d + log_distance,
bias_formula = ~ bias_e_origin,
target_scale = "true_flow",
observation_model = "coverage_offset",
coverage_scale = "origin",
model_engine = "bayesian",
scenario = "s1",
source_col = "mpd_source",
time_col = "mpd_time",
repeated_observation = "none",
prediction_scope = "complete_grid",
random_intercept = "none",
model_family = "poisson",
flow_adj_summary = "median",
include_flow_adj_draws = TRUE,
iter = 1000,
chains = 2,
seed = 123,
refresh = 0
)The precomputed output below reports both posterior median and posterior mean summaries from this model fit.
| origin | destination | flow_mpd_raw | flow_mpd_adjusted_median | flow_mpd_adjusted_mean | flow_benchmark | observation_probability | flow_mpd_pred_median | flow_mpd_pred_mean |
|---|---|---|---|---|---|---|---|---|
| E06000011 | E06000011 | 2403 | 82289.89 | 82299.34 | 57155 | 0.04 | 3678.41 | 3678.83 |
| E06000011 | E06000016 | 1 | 506.20 | 506.70 | 10 | 0.04 | 22.63 | 22.65 |
| E06000011 | E06000018 | 1 | 523.64 | 524.08 | 15 | 0.04 | 23.41 | 23.43 |
| E06000011 | E06000023 | 2 | 461.64 | 462.10 | 3 | 0.04 | 20.64 | 20.66 |
| E06000011 | E06000047 | 6 | 508.61 | 508.49 | 11 | 0.04 | 22.73 | 22.73 |
The rendered table uses compact method-comparison names. The benchmark column is shown for interpretation and validation, but it is not used to fit this Bayesian model. The guide below maps the displayed columns to the names used in the object returned by adjust_multilevel_bayes().
| Rendered column | Returned object column | Meaning |
|---|---|---|
flow_mpd_raw |
flow |
Observed MPD count. |
flow_mpd_adjusted_median |
flow_adj_median |
Posterior median of the adjusted true-flow estimate. |
flow_mpd_adjusted_mean |
flow_adj_mean |
Posterior mean of the adjusted true-flow estimate. |
flow_benchmark |
not used for fitting | Benchmark OD flow shown for interpretation and validation. |
observation_probability |
observation_probability |
Fixed coverage offset on the probability scale. |
flow_mpd_pred_median |
flow_mpd_pred_median |
Posterior median of the fitted MPD observation-scale flow. |
flow_mpd_pred_mean |
flow_mpd_pred_mean |
Posterior mean of the fitted MPD observation-scale flow. |
The full returned object also includes flow_true_pred, flow_mpd_pred, and interval columns such as flow_adj_q2.5, flow_adj_q97.5, flow_mpd_pred_q2.5 and flow_mpd_pred_q97.5. The function call sets flow_adj_summary = "median", so the returned flow_adj equals flow_adj_median by default.
The saved example uses moderate sampler settings for a compact illustration. For applied analysis, run enough iterations and chains for your data and inspect the diagnostics.
Additional outputs and diagnostics
You can inspect the metadata and terms to confirm the scenario and formula fitted. For a live fit, also inspect sampler diagnostics before interpreting the posterior summaries.
| field | value |
|---|---|
| model_engine | bayesian |
| backend | rstanarm |
| target_scale | true_flow |
| observation_model | coverage_offset |
| coverage_scale | origin |
| offset_column | log_observation_probability |
| scenario | s1 |
| repeated_observation | none |
| n_sources | 1 |
| n_time_periods | 1 |
| prediction_scope | complete_grid |
| flow_adj_summary | median |
| component | value |
|---|---|
| formula | flow ~ rural_pct_o + rural_pct_d + log_distance + offset(log_observation_probability) |
| formula_source | split_formula |
| formula_interface | split_true_flow |
| mobility_formula | ~rural_pct_o + rural_pct_d + log_distance |
| bias_formula | ~bias_e_origin |
| mobility_variables | rural_pct_o, rural_pct_d, log_distance |
| bias_variables | NA |
| user_formula | TRUE |
| custom_formula | FALSE |
| default_area_covariate | rural_pct |
| default_fixed_effects | NA |
| scenario_fixed_effects | NA |
| formula_variables | flow, rural_pct_o, rural_pct_d, log_distance, log_observation_probability |
| requested_random_intercept | none |
| random_effect_term | NA |
| formula_random_effects | NA |
The metadata report the engine, target scale, observation model, coverage offset, resolved scenario, repeated-observation structure, prediction scope and model terms used to fit the adjustment model. They also confirm that this example is using the S1 contract: one MPD source and one observation period. In a live returned object, flow stores the observed MPD count, flow_mpd_pred stores the fitted MPD-scale prediction, and flow_true_pred and flow_adj store the estimated true-flow adjustment. This method is useful when users need a structured model that can combine coverage, covariates, distance and hierarchical data structure. Its main limitation is that it is more data hungry than other methods and requires significantly more runtime, dependencies and diagnostic checking when fitted with the full Bayesian engine. The key strength is that the model does not require a benchmark dataset of flows.
More complex Bayesian structures
The example above uses the simplest S1 structure: one source and one observation period. adjust_multilevel_bayes() can also accept S2-S4 source/time structures through scenario, source_col, time_col and repeated_observation.
For repeated source/time data, the latest experimental option is observation_model = "latent_two_level". In this mode, the model estimates one shared latent true-flow state for an OD pair or OD-time group and treats each source/time MPD row as an observation of that state. This is most useful for S3 and S4 structures, where multiple MPD sources observe the same underlying mobility state. Source and time effects live in the observation layer, so flow_true_pred is source-invariant within a latent state, while flow_mpd_pred can vary by source and time.
adj_latent <- adjust_multilevel_bayes(
mpd_od_df = mpd_repeated,
coverage_df = coverage_repeated,
covariates_df = covariates,
distance_df = distance,
mobility_formula = ~ rural_pct_o + rural_pct_d + log_distance,
bias_formula = ~ bias_e_origin,
target_scale = "true_flow",
observation_model = "latent_two_level",
backend = "stan_latent",
model_engine = "bayesian",
scenario = "s4",
source_col = "mpd_source",
time_col = "mpd_time",
repeated_observation = "source_time",
latent_flow_unit = "auto"
)
attr(adj_latent, "result_metadata")
attr(adj_latent, "diagnostics")backend = "auto" also selects the latent Stan backend for this observation model.
This backend is still experimental. Use it after checking that the data really contain repeated source/time observations and after inspecting the diagnostics, including divergence rate, treedepth hit rate, acceptance summaries, R-hat and effective sample size and posterior predictive metadata where available. The advanced Bayesian adjustment vignette provides deeper detail on source-time structures, reduced-form compatibility and latent-backend diagnostics.
Fit all methods
We can also run all six adjustment methods at once on the same empirical LAD subset. The helper uses the frequentist multilevel engine by default here so the method-comparison table renders quickly; switch multilevel_engine to "bayesian" when you want posterior summaries and have time to inspect sampler diagnostics.
method_results <- fit_adjustment_methods(
example_data,
include_multilevel = TRUE,
multilevel_engine = "frequentist"
)
method_engine <- function(x) {
engine <- attr(x, "model_engine", exact = TRUE)
if (is.null(engine)) "deterministic" else engine
}
tibble::tibble(
method = names(method_results),
engine = vapply(method_results, method_engine, character(1)),
rows = vapply(method_results, nrow, integer(1)),
adjusted_total = vapply(
method_results,
function(x) sum(x$flow_adj, na.rm = TRUE),
numeric(1)
)
)# A tibble: 6 × 4
method engine rows adjusted_total
<chr> <chr> <int> <dbl>
1 inverse_penetration deterministic 625 2299443.
2 selection_rate deterministic 625 2055645.
3 selection_rate2 deterministic 625 76931.
4 raking_ratio deterministic 625 2302435
5 coefficient deterministic 625 2331903.
6 multilevel_bayes frequentist 625 56235.
Method summary
As described above, each of the adjustment methods implemented via debiasR has different assumptions, input data requirements, strengths and weaknesses. The table below offers a summary which can be used as a conceptual guide helping users identify which method is more appropriate given their purpose and constraints.
| Method |
Correction level |
How the correction is applied |
Rationale | Equation |
Benchmark population |
Active users |
Benchmark OD flows |
Covariates | Strengths | Limitations |
|---|---|---|---|---|---|---|---|---|---|---|
| Inverse Penetration Rate Weights | Area-specific coverage correction |
Applies origin, destination, or both-side coverage weights using weight_by.
|
Inflates observed flows using population-to-user ratios where coverage is thin. |
wi = Pi / Ui
Fadjij = Fmpdij × wi, wj, or √(wiwj)
|
Yes | Yes | No | No | Simple, transparent, fast, and does not require benchmark OD flows. | Can overinflate sparse areas and misses compositional selection bias. |
| Selection Rate | Area-specific selection correction | Applies origin, destination, or both-side selection weights controlled by a global rt parameter. | Uses coverage, covariates, and optional benchmark calibration to model differential selection. |
wi(rt) = 1 / (Iici + (1 - Ii)rt)
Fadjij = Fmpdij × wij
|
Yes | Yes | Optional | Yes | More flexible than inverse penetration and can use benchmark calibration. | Depends on covariate quality, calibration choices, and a meaningful selection parameter. |
| Selection Rate II | Area-specific nonlinear coverage correction |
Applies origin, destination, or both-side coverage correction controlled by a global k parameter.
|
Applies a compact coverage correction with optional k calibration.
|
CF(c; k) = c × (exp(-k) - 1) / (exp(-kc) - 1)
Fadjij = Fmpdij × CF(…)
|
Yes | Yes | Optional | No | Parsimonious, interpretable, and lighter than a covariate-rich adjustment. | Less flexible when bias varies strongly across areas. |
| Raking Ratio (or Iterative proportional fitting) | Margin-constrained correction | Iteratively rescales OD cells so adjusted origin and/or destination totals match supplied margins. | Uses iterative proportional fitting so adjusted flows match trusted origin and destination margins. |
Fadjij = Fmpdij × wij
ΣjFadjij = Oi and ΣiFadjij = Dj
|
Optional as targets | No | Yes or user targets | No | Guarantees consistency with supplied marginal totals. | Matching margins does not guarantee correct destination allocation within each origin. |
| Coefficient regression modelling | Global coefficient correction | Estimates a multiplicative coefficient from MPD-benchmark overlap and applies it to the MPD table. | Learns a direct mapping from MPD flows to benchmark flows and applies it to the MPD table. |
E[Fbenchij] = β Fmpdij
log(μij) = α + log(Fmpdij)
|
No | No | Yes | No | Direct benchmark-driven calibration with several model families. | Benchmark quality is critical and a global coefficient can miss local heterogeneity. |
| Bayesian multilevel modelling | Model-based corridor correction | Predicts OD-specific adjusted flows from a true-flow formula and a coverage-offset observation model. | Fits a count model where active-user coverage enters as a fixed observation-process offset, then returns the estimated true-flow scale. |
log E(Fmpdij) = log(qij) + ηtrueij
ηtrueij = α + Xiβo + Xjβd + γlog(dij)
Fadjij = exp(ηtrueij)
|
Yes | Yes | No | Yes | Richer structure, no benchmark OD needed for fitting, and a path to uncertainty-aware partial pooling. | Heavier runtime and dependencies; posterior fitting requires sampler diagnostics and careful runtime checks. |