Skip to contents

Aim

The previous vignette measured population coverage bias and showed that some LADs are covered better than others. Now, we ask the following question: are those differences in coverage related to the characteristics of the local populations?

The aim is to show how coverage bias can be explored and explained. The research paper associated with the DEBIAS methodology uses more flexible machine-learning methods for analysis. Here, we use a simpler regression model because it is easier to understand and a useful first step for package users.

Load bias and covariates

In this vignette, we illustrate how debiasR can be used with example UK data at the local authority district (LAD) level. We start with two tables from debiasRdata: a coverage table and a covariates table. The coverage table contains the benchmark Census population and the observed digital trace user count for each LAD. The covariate table contains selected area-level characteristics from the 2021 Census.

The coverage table stores the LAD identifier in a column called code. We rename it to area so that it uses the same identifier name as the covariate table and the rest of the debiasR workflow.

coverage <- debiasRdata::coverage_lad |>
  rename(area = code)

covariates <- debiasRdata::lad_covariates

head(coverage)
  date                 name      area population user_count
1 2021           Hartlepool E06000001      92338       1207
2 2021        Middlesbrough E06000002     143926       1028
3 2021 Redcar and Cleveland E06000003     136531       1144
4 2021     Stockton-on-Tees E06000004     196595       1755
5 2021           Darlington E06000005     107799       1165
6 2021               Halton E06000006     128478       1744

The covariates included in the data describe some demographic, socioeconomic and geographic characteristics of each LAD: percentage of UK-born residents, percentage of residents aged 20-29, percentage of residents aged 70 or over, percentage of residents with Level 4 qualifications, percentage of households without central heating, percentage of residents in routine occupations and percentage of rural land.

head(covariates)
       area                 name year per_ukborn per_age_20.29 per_age_70plus
1 E06000001           Hartlepool 2021   96.03505      11.48510       14.06370
2 E06000002        Middlesbrough 2021   87.71191      14.23857       11.79217
3 E06000003 Redcar and Cleveland 2021   97.08560      10.28030       17.02885
4 E06000004     Stockton-on-Tees 2021   93.76027      10.99062       13.42404
5 E06000005           Darlington 2021   92.18360      10.95073       14.91823
6 E06000006               Halton 2021   95.23031      11.41587       13.00290
  per_level4 per_hh_no_centralheat per_NS_SeC_L13_routine  rural_pct
1   24.80518             0.8355729               15.71647  3.0949158
2   26.44134             1.2478842               15.06082  0.2195405
3   24.92656             0.9231318               14.94379 31.6923280
4   29.52218             0.8417207               14.10037  5.5527863
5   28.96138             0.9526146               15.20799 13.0556973
6   23.93120             1.4351844               16.35048  0.7223308

Next, we calculate the active-user coverage rate and coverage bias.

bias_lad <- debiasR::measure_bias(coverage)

bias_lad |>
  select(area, name, population, user_count, coverage_score, coverage_bias) |>
  head()
       area                 name population user_count coverage_score
1 E06000001           Hartlepool      92338       1207    0.013071542
2 E06000002        Middlesbrough     143926       1028    0.007142559
3 E06000003 Redcar and Cleveland     136531       1144    0.008379049
4 E06000004     Stockton-on-Tees     196595       1755    0.008926982
5 E06000005           Darlington     107799       1165    0.010807150
6 E06000006               Halton     128478       1744    0.013574308
  coverage_bias
1     0.9869285
2     0.9928574
3     0.9916210
4     0.9910730
5     0.9891928
6     0.9864257

The previous vignette introduced the idea of a proportional baseline. If every LAD were covered at the same overall rate, each local coverage rate would sit close to the national active-user coverage rate. Residuals show locations (LADs) where a place departs from that simple expectation.

bias_residuals <- debiasR::validate_bias_residual_structure(
  coverage_df = coverage,
  coverage_area_col = "area",
  population_col = "population",
  user_count_col = "user_count"
)

residual_lad <- bias_residuals$area_level |>
  mutate(
    baseline_coverage_bias = 1 - global_coverage_score,
    residual_bias = coverage_bias - baseline_coverage_bias
  ) |>
  left_join(
    coverage |> select(area, name),
    by = "area"
  )

residual_lad |>
  select(area, name, coverage_score, coverage_bias, residual_bias) |>
  head()
       area                 name coverage_score coverage_bias residual_bias
1 E06000001           Hartlepool    0.013071542     0.9869285 -0.0020957494
2 E06000002        Middlesbrough    0.007142559     0.9928574  0.0038332327
3 E06000003 Redcar and Cleveland    0.008379049     0.9916210  0.0025967426
4 E06000004     Stockton-on-Tees    0.008926982     0.9910730  0.0020488102
5 E06000005           Darlington    0.010807150     0.9891928  0.0001686417
6 E06000006               Halton    0.013574308     0.9864257 -0.0025985164

Positive values of residual_bias identify LADs with more coverage bias than expected under the proportional baseline. Negative values identify LADs with less coverage bias than expected.

Next, we join these residuals to the area-level covariates.

bias_covariates <- residual_lad |>
  left_join(
    covariates |>
      select(
        area,
        per_age_20.29,
        per_age_70plus,
        per_level4,
        per_hh_no_centralheat,
        per_NS_SeC_L13_routine,
        rural_pct
      ),
    by = "area"
  )

bias_covariates |>
  select(
    area,
    name,
    residual_bias,
    per_age_20.29,
    per_age_70plus,
    per_level4,
    rural_pct
  ) |>
  head()
       area                 name residual_bias per_age_20.29 per_age_70plus
1 E06000001           Hartlepool -0.0020957494      11.48510       14.06370
2 E06000002        Middlesbrough  0.0038332327      14.23857       11.79217
3 E06000003 Redcar and Cleveland  0.0025967426      10.28030       17.02885
4 E06000004     Stockton-on-Tees  0.0020488102      10.99062       13.42404
5 E06000005           Darlington  0.0001686417      10.95073       14.91823
6 E06000006               Halton -0.0025985164      11.41587       13.00290
  per_level4  rural_pct
1   24.80518  3.0949158
2   26.44134  0.2195405
3   24.92656 31.6923280
4   29.52218  5.5527863
5   28.96138 13.0556973
6   23.93120  0.7223308

Explore one relationship at a time

Before fitting a model, it is useful to look at simple bivariate relationships. These plots enable us to see whether residual bias tends to be higher or lower in areas with different local characteristics.

ggplot(bias_covariates, aes(x = per_age_20.29, y = residual_bias)) +
  geom_hline(yintercept = 0, colour = "grey50") +
  geom_point(alpha = 0.7, colour = "#356859") +
  geom_smooth(method = "lm", se = FALSE, colour = "#8B3A3A") +
  labs(
    x = "Population aged 20-29 (%)",
    y = "Residual coverage bias",
    title = "Does residual bias vary with the young-adult population share?"
  ) +
  theme_minimal()

ggplot(bias_covariates, aes(x = per_level4, y = residual_bias)) +
  geom_hline(yintercept = 0, colour = "grey50") +
  geom_point(alpha = 0.7, colour = "#356859") +
  geom_smooth(method = "lm", se = FALSE, colour = "#8B3A3A") +
  labs(
    x = "Population with level 4 qualifications (%)",
    y = "Residual coverage bias",
    title = "Does residual bias vary with the share of highly educated individuals?"
  ) +
  theme_minimal()

It is important to remember that these plots do not prove causality. However, they help us identify whether coverage bias displays any meaningful patterns rather than being randomly distributed across places.

Fit a simple explanatory model

We now fit a multi-variate linear regression. The outcome is residual_bias, which measures how much more or less coverage bias an LAD has compared with the proportional baseline.

The predictors in this regression represent broad demographic, socioeconomic, resource-access and geographic characteristics:

  • per_age_20.29: share of the population aged 20-29
  • per_age_70plus: share of the population aged 70 or over
  • per_level4: share of the population with Level 4 qualifications
  • per_hh_no_centralheat: share of households without central heating
  • per_NS_SeC_L13_routine: share of the population in routine occupations
  • rural_pct: rural population share
bias_model <- lm(
  residual_bias ~
    per_age_20.29 +
    per_age_70plus +
    per_level4 +
    per_hh_no_centralheat +
    per_NS_SeC_L13_routine +
    rural_pct,
  data = bias_covariates
)

summary(bias_model)

Call:
lm(formula = residual_bias ~ per_age_20.29 + per_age_70plus +
    per_level4 + per_hh_no_centralheat + per_NS_SeC_L13_routine +
    rural_pct, data = bias_covariates)

Residuals:
       Min         1Q     Median         3Q        Max
-0.0145064 -0.0013750  0.0002266  0.0014997  0.0079651

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)
(Intercept)            -2.133e-02  2.558e-03  -8.339 2.17e-15 ***
per_age_20.29           2.300e-04  6.737e-05   3.415  0.00072 ***
per_age_70plus          2.155e-04  6.687e-05   3.222  0.00140 **
per_level4              2.347e-04  3.470e-05   6.764 6.30e-11 ***
per_hh_no_centralheat   6.681e-04  1.332e-04   5.015 8.77e-07 ***
per_NS_SeC_L13_routine  5.218e-04  7.486e-05   6.970 1.79e-11 ***
rural_pct              -7.687e-06  8.213e-06  -0.936  0.34999
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.002441 on 324 degrees of freedom
Multiple R-squared:  0.3095,    Adjusted R-squared:  0.2968
F-statistic: 24.21 on 6 and 324 DF,  p-value: < 2.2e-16

For easier reading, we can extract the coefficient table.

coefficient_table <- as.data.frame(coef(summary(bias_model)))
coefficient_table$term <- rownames(coefficient_table)
rownames(coefficient_table) <- NULL
names(coefficient_table) <- c("estimate", "std_error", "statistic", "p_value", "term")

coefficient_table |>
  select(term, estimate, std_error, statistic, p_value)
                    term      estimate    std_error  statistic      p_value
1            (Intercept) -2.133358e-02 2.558308e-03 -8.3389386 2.173616e-15
2          per_age_20.29  2.300286e-04 6.736613e-05  3.4146024 7.199558e-04
3         per_age_70plus  2.154692e-04 6.687211e-05  3.2221083 1.401820e-03
4             per_level4  2.347167e-04 3.470325e-05  6.7635367 6.297559e-11
5  per_hh_no_centralheat  6.680707e-04 1.332254e-04  5.0145895 8.769749e-07
6 per_NS_SeC_L13_routine  5.217525e-04 7.486137e-05  6.9695826 1.786132e-11
7              rural_pct -7.687229e-06 8.213161e-06 -0.9359648 3.499884e-01

The sign of each coefficient (estimate) gives the direction of the association, holding the other variables in the model constant. A positive coefficient means that higher values of the covariate are associated with more residual coverage bias (less representation than average in the mobile phone dataset). A negative coefficient means that higher values are associated with less residual coverage bias.

Compare observed and fitted residuals

The fitted values from the model summarise the part of residual bias that can be described by the selected covariates.

bias_covariates <- bias_covariates |>
  mutate(
    fitted_residual_bias = fitted(bias_model),
    unexplained_residual = residual_bias - fitted_residual_bias
  )

ggplot(bias_covariates, aes(x = fitted_residual_bias, y = residual_bias)) +
  geom_hline(yintercept = 0, colour = "grey80") +
  geom_vline(xintercept = 0, colour = "grey80") +
  geom_point(alpha = 0.7, colour = "#356859") +
  geom_smooth(method = "lm", se = FALSE, colour = "#8B3A3A") +
  labs(
    x = "Fitted residual bias",
    y = "Observed residual bias",
    title = "How much residual bias is described by the selected covariates?"
  ) +
  theme_minimal()

The largest unexplained residuals are areas where the simple model does not capture the local pattern well. These areas may be useful to inspect before choosing an adjustment strategy.

bias_covariates |>
  arrange(desc(abs(unexplained_residual))) |>
  select(
    area,
    name,
    residual_bias,
    fitted_residual_bias,
    unexplained_residual
  ) |>
  head(10)
        area              name residual_bias fitted_residual_bias
1  E09000001    City of London  -0.006384105         8.122286e-03
2  E07000228        Mid Sussex   0.005293006        -2.672141e-03
3  E07000069      Castle Point  -0.011125629        -3.874755e-03
4  E07000098         Hertsmere  -0.009093393        -2.256289e-03
5  E07000113             Swale  -0.008249398        -1.571936e-03
6  E07000030              Eden  -0.005470657         1.957089e-04
7  E08000032          Bradford   0.004585010        -9.408765e-04
8  E07000167           Ryedale  -0.005474680        -4.322945e-06
9  E07000135 Oadby and Wigston   0.004689751        -7.009294e-04
10 E06000034          Thurrock  -0.006666153        -1.387337e-03
   unexplained_residual
1          -0.014506392
2           0.007965147
3          -0.007250874
4          -0.006837104
5          -0.006677462
6          -0.005666366
7           0.005525887
8          -0.005470357
9           0.005390681
10         -0.005278816

Interpretation

This vignette uses a simple regression to “explain” bias. If residual bias shows a non-trivial relationship with area-level characteristics, then the variation in bias across geographies is not simply random noise. Instead, this is a sign that some types of places are more or less visible in the digital trace data than expected under a constant-coverage baseline.

This will be important for bias adjustment. A single global correction assumes that all places are covered at roughly the same rate. If coverage bias varies with demographic, socioeconomic or geographic characteristics, then adjustment methods need to allow bias to vary across places too.

The regression used here is deliberately simple and descriptive. The associated research paper uses more flexible explainable machine-learning methods to capture nonlinear and interacting relationships. The purpose of this vignette is to introduce the same logic in a way that is easy to inspect, reproduce and adapt.