Getting started • debiasR

Aim

This vignette introduces the notation and example data used throughout the debiasR vignettes and shows the first checks needed before measuring bias, adjusting OD flows and validating results.

A typical debiasR workflow starts by:

loading the package and necessary data
identifying the population count table, OD-flow table and optional benchmark flow table
checking that the observed and benchmark population and OD-flow tables use compatible areas and valid population and flow values

Loading the package

The README gives the GitHub installation commands for both debiasR and the companion debiasRdata package. Once those are installed, we start by loading the packages:

library(debiasR)
library(debiasRdata)

Notation

We use the same notation throughout:

$i$ : origin or area identifier
$j$ : destination identifier
$P_i$ : benchmark population for area $i$
$U_i$ : observed digital trace population for area $i$
$c_i = U_i / P_i$ : active-user coverage rate
$b_i = 1 - c_i$ : coverage bias
$\bar{c} = \sum_i U_i / \sum_i P_i$ : global active-user coverage rate
$e_i = c_i - \bar{c}$ : coverage-rate residual
$F^{trace}_{ij}$ : observed digital-trace-derived OD flow
$F^{mpd}_{ij}$ : observed mobile-phone-derived OD flow in the running example
$F^{bench}_{ij}$ : benchmark OD flow
$F^{adj}_{ij}$ : adjusted OD flow

Example data: Locomizer travel-to-work flows

The vignettes use an empirical example at Local Authority District (LAD) level, combining mobile-phone mobility data provided by Locomizer with benchmark data from the 2021 UK Census. Locomizer is a spatial analytics company that processes anonymised mobile phone location data to produce mobility-related insights. In this example, the observed OD-flow table represents travel-to-work flows inferred from those mobile phone records, while the benchmark OD-flow table comes from the 2021 Census travel-to-work data. The Locomizer data has been pre-processed and openly shared in an aggregated format by (Zhong et al. 2025).

The first step is to identify the core inputs for the debiasR workflow, i.e. the observed OD-flow table, the benchmark OD-flow table and the area-level coverage table used to measure population coverage bias. Here we load these directly from the debiasRdata package.

OD_travel2work <- debiasRdata::lad_OD_travel2work 
census_OD_travel2work <- debiasRdata::census_lad_OD_travel2work
coverage <- debiasRdata::coverage_lad

The example data available via debiasRdata also includes supporting inputs used in later vignettes. Particularly, covariates, which contains area-level characteristics used to explain variation in bias across places, and centroids, which contains LAD centroid coordinates that can be used to derive distances for flow adjustment models.

covariates <- debiasRdata::lad_covariates
centroids <- debiasRdata::lad_centroids

In summary, the main input datasets are as follows:

Object	Role
`OD_travel2work`	observed mobile-phone-derived OD flows
`census_OD_travel2work`	benchmark OD flows from the Census
`coverage`	area-level benchmark and observed digital trace population counts
`covariates`	area-level characteristics used to explain bias
`centroids`	LAD centroid coordinates that can support distance derivation

The datasets are also available at the Middle-layer Super-Output Area (MSOA) level via debiasRdata.

Inspect inputs

Before proceeding to the next steps of the workflow, it is useful to check and understand what each input dataset contains and whether the inputs are valid.

Below, we produce a compact row and column count, which gives a quick overview of the size of the main datasets. If the workflow includes $N$ areas and within-area flows are included, a complete OD-flow table has $N \times N$ rows: one row for every possible origin-destination pair. If within-area flows are excluded, the complete table has $N \times (N - 1)$ rows. If the table only records observed flows, it can have fewer rows, because OD pairs with no recorded flow may be absent. The coverage table should have one row per area, so it should contain $N$ rows.

For the supporting tables, the covariates table should also have one row per area, with columns describing area-level characteristics used to explain variation in bias. The centroids table should contain one row per area with coordinate columns that can be used to calculate distances between origins and destinations.

nrow(OD_travel2work)

[1] 42455

nrow(census_OD_travel2work)

[1] 69983

nrow(coverage)

[1] 331

nrow(covariates)

[1] 331

nrow(centroids)

[1] 363

Next, we can explore the content of each table. We print the column names and the first few rows to show basic structure of the core workflow tables.

Observed flows

The OD_travel2work table is the observed mobile-phone-derived OD-flow table. Each row represents a flow from a home area (in this case, a Local Authority District or LAD), stored in origin, to a workplace area, stored in destination. The flow column gives the observed number of mobile-phone-derived travel-to-work movements for that OD pair.

The table stores the observed OD pairs available in the empirical companion package. If an adjustment workflow needs a strict square OD grid, later steps can construct that grid from the observed and benchmark tables.

head(OD_travel2work)

     origin destination flow
1 E06000001   E06000001  854
2 E06000001   E06000002   33
3 E06000001   E06000003   14
4 E06000001   E06000004   95
5 E06000001   E06000005   15
6 E06000001   E06000009    1

Benchmark flows

The census_OD_travel2work table is the benchmark OD-flow table. Each row represents a Census travel-to-work flow from a home area, stored in origin, to a workplace area, stored in destination. The flow column gives the benchmark number of people travelling from the origin to the destination.

The benchmark table uses the same origin, destination and flow schema as the observed table, which makes the two sources easy to compare or join in later steps.

head(census_OD_travel2work)

     origin destination  flow
1 E06000001   E06000001 14513
2 E06000001   E06000002  1500
3 E06000001   E06000003   671
4 E06000001   E06000004  4240
5 E06000001   E06000005   398
6 E06000001   E06000007     2

Population counts

The coverage dataset is the area-level population count table used to measure bias. Each row represents one spatial unit. The population column gives the benchmark population count for the area, taken from the UK 2021 Census. The user_count column gives the observed digital trace population count, based on the number of users with detected home locations in that area.

Home locations have been inferred from anonymised mobile-phone location records before movements are aggregated into travel-to-work OD flows. Further details on the home and work detection process are provided by Zhong et al. (2025).

head(coverage)

  date                 name      code population user_count
1 2021           Hartlepool E06000001      92338       1207
2 2021        Middlesbrough E06000002     143926       1028
3 2021 Redcar and Cleveland E06000003     136531       1144
4 2021     Stockton-on-Tees E06000004     196595       1755
5 2021           Darlington E06000005     107799       1165
6 2021               Halton E06000006     128478       1744

The key comparison is between $U_i$ , the observed digital trace user count, and $P_i$ , the benchmark population count. The next vignette uses these values to calculate active-user coverage rate $c_i = U_i / P_i$ and coverage bias $b_i = 1 - c_i$ .

Zhong, Chen, Zhengzi Zhou, Nilufer Sari Aslam, Yikang Wang, and Adham Enaya. 2025. “Anonymised Human Location Data in England for Urban Mobility Research.” Scientific Data 12 (1). https://doi.org/10.1038/s41597-025-06323-8.