Validate Bias Residual Structure — validate_bias_residual

Builds Stage 3 diagnostics for active-user coverage residuals. The helper starts from the same coverage quantities as measure_bias(), computes a global coverage score, and returns area-level residuals plus optional spatial, benchmark-flow, covariate, and plotting diagnostics.

The default residual is: $$coverage\_score\_residual_i = \frac{user\_count_i}{population_i} - \frac{\sum_i user\_count_i}{\sum_i population_i}$$

Positive values mean an area has higher active-user coverage than expected under a constant global coverage rate. Negative values mean lower coverage than expected under that global rate.

Usage

validate_bias_residual_structure(
  coverage_df,
  coverage_area_col = "origin",
  population_col = "population",
  user_count_col = "user_count",
  residual_type = c("coverage_score", "user_count", "standardized_user_count",
    "population_lm"),
  benchmark_od_df = NULL,
  origin_col = "origin",
  destination_col = "destination",
  flow_col_bench = "flow",
  benchmark_flow_roles = c("origin", "destination"),
  area_neighbors = NULL,
  area_col = "area",
  neighbor_col = "neighbor",
  weight_col = NULL,
  covariate_df = NULL,
  covariate_col = NULL,
  covariate_area_col = "area",
  geometry_df = NULL,
  geometry_area_col = "area",
  x_col = NULL,
  y_col = NULL,
  make_plots = FALSE
)

Arguments

coverage_df: A data frame with one row per area and columns containing an area identifier, benchmark population, and active-user count.
coverage_area_col: Column in coverage_df identifying the area. Default "origin".
population_col: Population column in coverage_df. Default "population".
user_count_col: Active-user count column in coverage_df. Default "user_count".
residual_type: Residual series to diagnose: "coverage_score" uses the coverage score minus the global coverage score, "user_count" uses observed minus expected user counts, and "standardized_user_count" uses the user-count residual divided by sqrt(expected_user_count). "population_lm" uses residuals from a descriptive user_count ~ population linear model.
benchmark_od_df: Optional benchmark OD data frame. When supplied, benchmark flows are collapsed to area-level origin and/or destination totals and correlated with the selected bias residual.
origin_col: Origin column in benchmark_od_df. Default "origin".
destination_col: Destination column in benchmark_od_df. Default "destination".
flow_col_bench: Benchmark flow column in benchmark_od_df. Default "flow".
benchmark_flow_roles: Which benchmark area totals to compute: "origin", "destination", or "both". The default c("origin", "destination") returns both separately.
area_neighbors: Optional neighbour table for Moran's I.
area_col: Column in area_neighbors identifying the focal area. Default "area".
neighbor_col: Column in area_neighbors identifying the neighbouring area. Default "neighbor".
weight_col: Optional positive numeric weight column in area_neighbors. If NULL, all neighbour links receive weight 1.
covariate_df: Optional area-level covariate table.
covariate_col: Optional covariate column to correlate with area-level bias residuals. Requires covariate_df.
covariate_area_col: Area key in covariate_df. Default "area".
geometry_df: Optional area table with coordinates or geometry-like columns to join onto map_data.
geometry_area_col: Area key in geometry_df. Default "area".
x_col: Optional x-coordinate column in map_data, used only when make_plots = TRUE.
y_col: Optional y-coordinate column in map_data, used only when make_plots = TRUE.
make_plots: Logical. If TRUE, return ggplot objects for the selected residual distribution, optional residual-versus-benchmark-flow scatter, optional residual-versus-covariate scatter, and optional coordinate residual map. Requires ggplot2.

Value

A list with:

summary: one-row tibble with global coverage, residual spread, Moran's I, and benchmark-flow/covariate correlations when available,
residual_definitions: definitions and sign interpretations,
moran_i: Moran's I summary from the neighbour table, or NA when no neighbour table is supplied,
benchmark_flow_correlation: Pearson correlations between the selected bias residual and benchmark origin/destination flow totals when benchmark OD data are supplied,
covariate_correlation: optional Pearson correlation between selected bias residuals and the selected covariate,
area_level: area-level residual table,
map_data: area-level residual table joined to geometry_df when supplied,
benchmark_flow_data, covariate_data, and plots: optional review-ready outputs when requested.

Examples

data(simulated_coverage)
data(simulated_benchmark.od)
data(simulated_covariates)

validate_bias_residual_structure(
  coverage_df = simulated_coverage,
  benchmark_od_df = simulated_benchmark.od,
  covariate_df = simulated_covariates,
  covariate_col = "internet_access"
)
#> $summary
#> # A tibble: 1 × 14
#>   residual_type  selected_residual_col n_areas total_population total_user_count
#>   <chr>          <chr>                   <int>            <dbl>            <dbl>
#> 1 coverage_score coverage_score_resid…     328          5796823           626483
#> # ℹ 9 more variables: global_coverage_score <dbl>, mean_coverage_score <dbl>,
#> #   sd_coverage_score <dbl>, mean_selected_residual <dbl>,
#> #   sd_selected_residual <dbl>, moran_i <dbl>,
#> #   pearson_bias_benchmark_origin_flow <dbl>,
#> #   pearson_bias_benchmark_destination_flow <dbl>, pearson_bias_covariate <dbl>
#> 
#> $residual_definitions
#> # A tibble: 4 × 3
#>   residual                         definition                     interpretation
#>   <chr>                            <chr>                          <chr>         
#> 1 coverage_score_residual          coverage_score - global_cover… Positive valu…
#> 2 user_count_residual              user_count - expected_user_co… Positive valu…
#> 3 standardized_user_count_residual user_count_residual / sqrt(ex… Positive valu…
#> 4 population_lm_residual           user_count - fitted(user_coun… Positive valu…
#> 
#> $moran_i
#> # A tibble: 1 × 5
#>   residual_type  n_areas_used n_links_used weight_sum moran_i
#>   <chr>                 <int>        <int>      <dbl>   <dbl>
#> 1 coverage_score          328           NA         NA      NA
#> 
#> $benchmark_flow_correlation
#> # A tibble: 2 × 4
#>   residual_type  benchmark_flow_role     n pearson_r
#>   <chr>          <chr>               <int>     <dbl>
#> 1 coverage_score destination           328    0.0860
#> 2 coverage_score origin                328    0.0849
#> 
#> $covariate_correlation
#> # A tibble: 1 × 4
#>   residual_type  covariate           n pearson_r
#>   <chr>          <chr>           <int>     <dbl>
#> 1 coverage_score internet_access   328    0.0837
#> 
#> $population_lm
#> # A tibble: 1 × 6
#>   residual_type model               n intercept population_coefficient r_squared
#>   <chr>         <chr>           <int>     <dbl>                  <dbl>     <dbl>
#> 1 population_lm user_count ~ p…   328     -296.                  0.125     0.749
#> 
#> $area_level
#> # A tibble: 328 × 17
#>    area                 population user_count coverage_score coverage_bias  bias
#>    <chr>                     <dbl>      <dbl>          <dbl>         <dbl> <dbl>
#>  1 Adur                       4975        558         0.112          0.888 0.888
#>  2 Allerdale                  6750       1005         0.149          0.851 0.851
#>  3 Amber Valley               9334        808         0.0866         0.913 0.913
#>  4 Arun                      13750       1416         0.103          0.897 0.897
#>  5 Ashfield                   9323       1280         0.137          0.863 0.863
#>  6 Ashford                   12042       1287         0.107          0.893 0.893
#>  7 Babergh                    7491        449         0.0599         0.940 0.940
#>  8 Barking and Dagenham      20930       1279         0.0611         0.939 0.939
#>  9 Barnet                    45399       2723         0.0600         0.940 0.940
#> 10 Barnsley                  17559       1349         0.0768         0.923 0.923
#> # ℹ 318 more rows
#> # ℹ 11 more variables: global_coverage_score <dbl>, expected_user_count <dbl>,
#> #   user_count_residual <dbl>, coverage_score_residual <dbl>,
#> #   standardized_user_count_residual <dbl>,
#> #   population_lm_expected_user_count <dbl>, population_lm_residual <dbl>,
#> #   residual_type <chr>, selected_residual <dbl>,
#> #   benchmark_origin_flow_total <dbl>, benchmark_destination_flow_total <dbl>
#> 
#> $map_data
#> # A tibble: 328 × 17
#>    area                 population user_count coverage_score coverage_bias  bias
#>    <chr>                     <dbl>      <dbl>          <dbl>         <dbl> <dbl>
#>  1 Adur                       4975        558         0.112          0.888 0.888
#>  2 Allerdale                  6750       1005         0.149          0.851 0.851
#>  3 Amber Valley               9334        808         0.0866         0.913 0.913
#>  4 Arun                      13750       1416         0.103          0.897 0.897
#>  5 Ashfield                   9323       1280         0.137          0.863 0.863
#>  6 Ashford                   12042       1287         0.107          0.893 0.893
#>  7 Babergh                    7491        449         0.0599         0.940 0.940
#>  8 Barking and Dagenham      20930       1279         0.0611         0.939 0.939
#>  9 Barnet                    45399       2723         0.0600         0.940 0.940
#> 10 Barnsley                  17559       1349         0.0768         0.923 0.923
#> # ℹ 318 more rows
#> # ℹ 11 more variables: global_coverage_score <dbl>, expected_user_count <dbl>,
#> #   user_count_residual <dbl>, coverage_score_residual <dbl>,
#> #   standardized_user_count_residual <dbl>,
#> #   population_lm_expected_user_count <dbl>, population_lm_residual <dbl>,
#> #   residual_type <chr>, selected_residual <dbl>,
#> #   benchmark_origin_flow_total <dbl>, benchmark_destination_flow_total <dbl>
#> 
#> $benchmark_flow_data
#> # A tibble: 656 × 9
#>    area                 residual_type  benchmark_flow_role selected_residual
#>    <chr>                <chr>          <chr>                           <dbl>
#>  1 Adur                 coverage_score origin                        0.00409
#>  2 Allerdale            coverage_score origin                        0.0408 
#>  3 Amber Valley         coverage_score origin                       -0.0215 
#>  4 Arun                 coverage_score origin                       -0.00509
#>  5 Ashfield             coverage_score origin                        0.0292 
#>  6 Ashford              coverage_score origin                       -0.00120
#>  7 Babergh              coverage_score origin                       -0.0481 
#>  8 Barking and Dagenham coverage_score origin                       -0.0470 
#>  9 Barnet               coverage_score origin                       -0.0481 
#> 10 Barnsley             coverage_score origin                       -0.0312 
#> # ℹ 646 more rows
#> # ℹ 5 more variables: benchmark_flow_total <dbl>, population <dbl>,
#> #   user_count <dbl>, coverage_score <dbl>, coverage_score_residual <dbl>
#> 
#> $covariate_data
#> # A tibble: 328 × 18
#>    area       covariate_value population user_count coverage_score coverage_bias
#>    <chr>                <dbl>      <dbl>      <dbl>          <dbl>         <dbl>
#>  1 Adur                 0.254       4975        558         0.112          0.888
#>  2 Allerdale            0.181       6750       1005         0.149          0.851
#>  3 Amber Val…           0.317       9334        808         0.0866         0.913
#>  4 Arun                 0.377      13750       1416         0.103          0.897
#>  5 Ashfield             0.310       9323       1280         0.137          0.863
#>  6 Ashford              0.239      12042       1287         0.107          0.893
#>  7 Babergh              0.306       7491        449         0.0599         0.940
#>  8 Barking a…           0.377      20930       1279         0.0611         0.939
#>  9 Barnet               0.567      45399       2723         0.0600         0.940
#> 10 Barnsley             0.352      17559       1349         0.0768         0.923
#> # ℹ 318 more rows
#> # ℹ 12 more variables: bias <dbl>, global_coverage_score <dbl>,
#> #   expected_user_count <dbl>, user_count_residual <dbl>,
#> #   coverage_score_residual <dbl>, standardized_user_count_residual <dbl>,
#> #   population_lm_expected_user_count <dbl>, population_lm_residual <dbl>,
#> #   residual_type <chr>, selected_residual <dbl>,
#> #   benchmark_origin_flow_total <dbl>, benchmark_destination_flow_total <dbl>
#>