Aim
This vignette introduces the notation and example data used throughout the debiasR vignettes and shows the first checks needed before measuring bias, adjusting OD flows and validating results.
A typical debiasR workflow starts by:
- loading the package and necessary data
- identifying the population count table, OD-flow table and optional benchmark flow table
- checking that the observed and benchmark population and OD-flow tables use compatible areas and valid population and flow values
Loading the package
The README gives the GitHub installation commands for both debiasR and the companion debiasRdata package. Once those are installed, we start by loading the packages:
Notation
We use the same notation throughout:
- : origin or area identifier
- : destination identifier
- : benchmark population for area
- : observed digital trace population for area
- : active-user coverage rate
- : coverage bias
- : global active-user coverage rate
- : coverage-rate residual
- : observed digital-trace-derived OD flow
- : observed mobile-phone-derived OD flow in the running example
- : benchmark OD flow
- : adjusted OD flow
Example data: Locomizer travel-to-work flows
The vignettes use an empirical example at Local Authority District (LAD) level, combining mobile-phone mobility data provided by Locomizer with benchmark data from the 2021 UK Census. Locomizer is a spatial analytics company that processes anonymised mobile phone location data to produce mobility-related insights. In this example, the observed OD-flow table represents travel-to-work flows inferred from those mobile phone records, while the benchmark OD-flow table comes from the 2021 Census travel-to-work data.
The first step is to identify the core inputs for the debiasR workflow, i.e. the observed OD-flow table, the benchmark OD-flow table and the area-level coverage table used to measure population coverage bias. Here we load these directly from the debiasRdata package.
OD_travel2work <- debiasRdata::lad_OD_travel2work
census_OD_travel2work <- debiasRdata::census_lad_OD_travel2work
coverage <- debiasRdata::coverage_ladThe example data available via debiasRdata also includes supporting inputs used in later vignettes. Particularly, covariates, which contains area-level characteristics used to explain variation in bias across places, and centroids, which contains LAD centroid coordinates that can be used to derive distances for flow adjustment models.
covariates <- debiasRdata::lad_covariates
centroids <- debiasRdata::lad_centroidsIn summary, the main input datasets are as follows:
| Object | Role |
|---|---|
OD_travel2work |
observed mobile-phone-derived OD flows |
census_OD_travel2work |
benchmark OD flows from the Census |
coverage |
area-level benchmark and observed digital trace population counts |
covariates |
area-level characteristics used to explain bias |
centroids |
LAD centroid coordinates that can support distance derivation |
The datasets are also available at the Middle-layer Super-Output Area (MSOA) level via debiasRdata.
Inspect inputs
Before proceeding to the next steps of the workflow, it is useful to check and understand what each input dataset contains and whether the inputs are valid.
Below, we produce a compact row and column count, which gives a quick overview of the size of the main datasets. If the workflow includes areas and within-area flows are included, a complete OD-flow table has rows: one row for every possible origin-destination pair. If within-area flows are excluded, the complete table has rows. If the table only records observed flows, it can have fewer rows, because OD pairs with no recorded flow may be absent. The coverage table should have one row per area, so it should contain rows.
For the supporting tables, the covariates table should also have one row per area, with columns describing area-level characteristics used to explain variation in bias. The centroids table should contain one row per area with coordinate columns that can be used to calculate distances between origins and destinations.
nrow(OD_travel2work)[1] 42455
nrow(census_OD_travel2work)[1] 69983
nrow(coverage)[1] 331
nrow(covariates)[1] 331
nrow(centroids)[1] 363
Next, we can explore the content of each table. We print the column names and the first few rows to show basic structure of the core workflow tables.
Observed flows
The OD_travel2work table is the observed mobile-phone-derived OD-flow table. Each row represents a flow from a home area (in this case, a Local Authority District or LAD), stored in origin, to a workplace area, stored in destination. The flow column gives the observed number of mobile-phone-derived travel-to-work movements for that OD pair.
The table stores the observed OD pairs available in the empirical companion package. If an adjustment workflow needs a strict square OD grid, later steps can construct that grid from the observed and benchmark tables.
head(OD_travel2work) origin destination flow
1 E06000001 E06000001 854
2 E06000001 E06000002 33
3 E06000001 E06000003 14
4 E06000001 E06000004 95
5 E06000001 E06000005 15
6 E06000001 E06000009 1
Benchmark flows
The census_OD_travel2work table is the benchmark OD-flow table. Each row represents a Census travel-to-work flow from a home area, stored in origin, to a workplace area, stored in destination. The flow column gives the benchmark number of people travelling from the origin to the destination.
The benchmark table uses the same origin, destination and flow schema as the observed table, which makes the two sources easy to compare or join in later steps.
head(census_OD_travel2work) origin destination flow
1 E06000001 E06000001 14513
2 E06000001 E06000002 1500
3 E06000001 E06000003 671
4 E06000001 E06000004 4240
5 E06000001 E06000005 398
6 E06000001 E06000007 2
Population counts
The coverage dataset is the area-level population count table used to measure bias. Each row represents one spatial unit. The population column gives the benchmark population count for the area, taken from the UK 2021 Census. The user_count column gives the observed digital trace population count, based on the number of users with detected home locations in that area.
Home locations have been inferred from anonymised mobile-phone location records before movements are aggregated into travel-to-work OD flows. Further details on the home and work detection process are provided by Zhong et al. (2025).
head(coverage) date name code population user_count
1 2021 Hartlepool E06000001 92338 1207
2 2021 Middlesbrough E06000002 143926 1028
3 2021 Redcar and Cleveland E06000003 136531 1144
4 2021 Stockton-on-Tees E06000004 196595 1755
5 2021 Darlington E06000005 107799 1165
6 2021 Halton E06000006 128478 1744
The key comparison is between , the observed digital trace user count, and , the benchmark population count. The next vignette uses these values to calculate active-user coverage rate and coverage bias .