Builds Stage 3 diagnostics for active-user coverage residuals. The helper
starts from the same coverage quantities as measure_bias(), computes a
global coverage score, and returns area-level residuals plus optional
spatial, benchmark-flow, covariate, and plotting diagnostics.
The default residual is: $$coverage\_score\_residual_i = \frac{user\_count_i}{population_i} - \frac{\sum_i user\_count_i}{\sum_i population_i}$$
Positive values mean an area has higher active-user coverage than expected under a constant global coverage rate. Negative values mean lower coverage than expected under that global rate.
Usage
validate_bias_residual_structure(
coverage_df,
coverage_area_col = "origin",
population_col = "population",
user_count_col = "user_count",
residual_type = c("coverage_score", "user_count", "standardized_user_count",
"population_lm"),
benchmark_od_df = NULL,
origin_col = "origin",
destination_col = "destination",
flow_col_bench = "flow",
benchmark_flow_roles = c("origin", "destination"),
area_neighbors = NULL,
area_col = "area",
neighbor_col = "neighbor",
weight_col = NULL,
covariate_df = NULL,
covariate_col = NULL,
covariate_area_col = "area",
geometry_df = NULL,
geometry_area_col = "area",
x_col = NULL,
y_col = NULL,
make_plots = FALSE
)Arguments
- coverage_df
A data frame with one row per area and columns containing an area identifier, benchmark population, and active-user count.
- coverage_area_col
Column in
coverage_dfidentifying the area. Default"origin".- population_col
Population column in
coverage_df. Default"population".- user_count_col
Active-user count column in
coverage_df. Default"user_count".- residual_type
Residual series to diagnose:
"coverage_score"uses the coverage score minus the global coverage score,"user_count"uses observed minus expected user counts, and"standardized_user_count"uses the user-count residual divided bysqrt(expected_user_count)."population_lm"uses residuals from a descriptiveuser_count ~ populationlinear model.- benchmark_od_df
Optional benchmark OD data frame. When supplied, benchmark flows are collapsed to area-level origin and/or destination totals and correlated with the selected bias residual.
- origin_col
Origin column in
benchmark_od_df. Default"origin".- destination_col
Destination column in
benchmark_od_df. Default"destination".- flow_col_bench
Benchmark flow column in
benchmark_od_df. Default"flow".- benchmark_flow_roles
Which benchmark area totals to compute:
"origin","destination", or"both". The defaultc("origin", "destination")returns both separately.- area_neighbors
Optional neighbour table for Moran's I.
- area_col
Column in
area_neighborsidentifying the focal area. Default"area".- neighbor_col
Column in
area_neighborsidentifying the neighbouring area. Default"neighbor".- weight_col
Optional positive numeric weight column in
area_neighbors. IfNULL, all neighbour links receive weight 1.- covariate_df
Optional area-level covariate table.
- covariate_col
Optional covariate column to correlate with area-level bias residuals. Requires
covariate_df.- covariate_area_col
Area key in
covariate_df. Default"area".- geometry_df
Optional area table with coordinates or geometry-like columns to join onto
map_data.- geometry_area_col
Area key in
geometry_df. Default"area".- x_col
Optional x-coordinate column in
map_data, used only whenmake_plots = TRUE.- y_col
Optional y-coordinate column in
map_data, used only whenmake_plots = TRUE.- make_plots
Logical. If
TRUE, return ggplot objects for the selected residual distribution, optional residual-versus-benchmark-flow scatter, optional residual-versus-covariate scatter, and optional coordinate residual map. Requires ggplot2.
Value
A list with:
summary: one-row tibble with global coverage, residual spread, Moran's I, and benchmark-flow/covariate correlations when available,residual_definitions: definitions and sign interpretations,moran_i: Moran's I summary from the neighbour table, orNAwhen no neighbour table is supplied,benchmark_flow_correlation: Pearson correlations between the selected bias residual and benchmark origin/destination flow totals when benchmark OD data are supplied,covariate_correlation: optional Pearson correlation between selected bias residuals and the selected covariate,area_level: area-level residual table,map_data: area-level residual table joined togeometry_dfwhen supplied,benchmark_flow_data,covariate_data, andplots: optional review-ready outputs when requested.
Examples
data(simulated_coverage)
data(simulated_benchmark.od)
data(simulated_covariates)
validate_bias_residual_structure(
coverage_df = simulated_coverage,
benchmark_od_df = simulated_benchmark.od,
covariate_df = simulated_covariates,
covariate_col = "internet_access"
)
#> $summary
#> # A tibble: 1 × 14
#> residual_type selected_residual_col n_areas total_population total_user_count
#> <chr> <chr> <int> <dbl> <dbl>
#> 1 coverage_score coverage_score_resid… 328 5796823 626483
#> # ℹ 9 more variables: global_coverage_score <dbl>, mean_coverage_score <dbl>,
#> # sd_coverage_score <dbl>, mean_selected_residual <dbl>,
#> # sd_selected_residual <dbl>, moran_i <dbl>,
#> # pearson_bias_benchmark_origin_flow <dbl>,
#> # pearson_bias_benchmark_destination_flow <dbl>, pearson_bias_covariate <dbl>
#>
#> $residual_definitions
#> # A tibble: 4 × 3
#> residual definition interpretation
#> <chr> <chr> <chr>
#> 1 coverage_score_residual coverage_score - global_cover… Positive valu…
#> 2 user_count_residual user_count - expected_user_co… Positive valu…
#> 3 standardized_user_count_residual user_count_residual / sqrt(ex… Positive valu…
#> 4 population_lm_residual user_count - fitted(user_coun… Positive valu…
#>
#> $moran_i
#> # A tibble: 1 × 5
#> residual_type n_areas_used n_links_used weight_sum moran_i
#> <chr> <int> <int> <dbl> <dbl>
#> 1 coverage_score 328 NA NA NA
#>
#> $benchmark_flow_correlation
#> # A tibble: 2 × 4
#> residual_type benchmark_flow_role n pearson_r
#> <chr> <chr> <int> <dbl>
#> 1 coverage_score destination 328 0.0860
#> 2 coverage_score origin 328 0.0849
#>
#> $covariate_correlation
#> # A tibble: 1 × 4
#> residual_type covariate n pearson_r
#> <chr> <chr> <int> <dbl>
#> 1 coverage_score internet_access 328 0.0837
#>
#> $population_lm
#> # A tibble: 1 × 6
#> residual_type model n intercept population_coefficient r_squared
#> <chr> <chr> <int> <dbl> <dbl> <dbl>
#> 1 population_lm user_count ~ p… 328 -296. 0.125 0.749
#>
#> $area_level
#> # A tibble: 328 × 17
#> area population user_count coverage_score coverage_bias bias
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Adur 4975 558 0.112 0.888 0.888
#> 2 Allerdale 6750 1005 0.149 0.851 0.851
#> 3 Amber Valley 9334 808 0.0866 0.913 0.913
#> 4 Arun 13750 1416 0.103 0.897 0.897
#> 5 Ashfield 9323 1280 0.137 0.863 0.863
#> 6 Ashford 12042 1287 0.107 0.893 0.893
#> 7 Babergh 7491 449 0.0599 0.940 0.940
#> 8 Barking and Dagenham 20930 1279 0.0611 0.939 0.939
#> 9 Barnet 45399 2723 0.0600 0.940 0.940
#> 10 Barnsley 17559 1349 0.0768 0.923 0.923
#> # ℹ 318 more rows
#> # ℹ 11 more variables: global_coverage_score <dbl>, expected_user_count <dbl>,
#> # user_count_residual <dbl>, coverage_score_residual <dbl>,
#> # standardized_user_count_residual <dbl>,
#> # population_lm_expected_user_count <dbl>, population_lm_residual <dbl>,
#> # residual_type <chr>, selected_residual <dbl>,
#> # benchmark_origin_flow_total <dbl>, benchmark_destination_flow_total <dbl>
#>
#> $map_data
#> # A tibble: 328 × 17
#> area population user_count coverage_score coverage_bias bias
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Adur 4975 558 0.112 0.888 0.888
#> 2 Allerdale 6750 1005 0.149 0.851 0.851
#> 3 Amber Valley 9334 808 0.0866 0.913 0.913
#> 4 Arun 13750 1416 0.103 0.897 0.897
#> 5 Ashfield 9323 1280 0.137 0.863 0.863
#> 6 Ashford 12042 1287 0.107 0.893 0.893
#> 7 Babergh 7491 449 0.0599 0.940 0.940
#> 8 Barking and Dagenham 20930 1279 0.0611 0.939 0.939
#> 9 Barnet 45399 2723 0.0600 0.940 0.940
#> 10 Barnsley 17559 1349 0.0768 0.923 0.923
#> # ℹ 318 more rows
#> # ℹ 11 more variables: global_coverage_score <dbl>, expected_user_count <dbl>,
#> # user_count_residual <dbl>, coverage_score_residual <dbl>,
#> # standardized_user_count_residual <dbl>,
#> # population_lm_expected_user_count <dbl>, population_lm_residual <dbl>,
#> # residual_type <chr>, selected_residual <dbl>,
#> # benchmark_origin_flow_total <dbl>, benchmark_destination_flow_total <dbl>
#>
#> $benchmark_flow_data
#> # A tibble: 656 × 9
#> area residual_type benchmark_flow_role selected_residual
#> <chr> <chr> <chr> <dbl>
#> 1 Adur coverage_score origin 0.00409
#> 2 Allerdale coverage_score origin 0.0408
#> 3 Amber Valley coverage_score origin -0.0215
#> 4 Arun coverage_score origin -0.00509
#> 5 Ashfield coverage_score origin 0.0292
#> 6 Ashford coverage_score origin -0.00120
#> 7 Babergh coverage_score origin -0.0481
#> 8 Barking and Dagenham coverage_score origin -0.0470
#> 9 Barnet coverage_score origin -0.0481
#> 10 Barnsley coverage_score origin -0.0312
#> # ℹ 646 more rows
#> # ℹ 5 more variables: benchmark_flow_total <dbl>, population <dbl>,
#> # user_count <dbl>, coverage_score <dbl>, coverage_score_residual <dbl>
#>
#> $covariate_data
#> # A tibble: 328 × 18
#> area covariate_value population user_count coverage_score coverage_bias
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Adur 0.254 4975 558 0.112 0.888
#> 2 Allerdale 0.181 6750 1005 0.149 0.851
#> 3 Amber Val… 0.317 9334 808 0.0866 0.913
#> 4 Arun 0.377 13750 1416 0.103 0.897
#> 5 Ashfield 0.310 9323 1280 0.137 0.863
#> 6 Ashford 0.239 12042 1287 0.107 0.893
#> 7 Babergh 0.306 7491 449 0.0599 0.940
#> 8 Barking a… 0.377 20930 1279 0.0611 0.939
#> 9 Barnet 0.567 45399 2723 0.0600 0.940
#> 10 Barnsley 0.352 17559 1349 0.0768 0.923
#> # ℹ 318 more rows
#> # ℹ 12 more variables: bias <dbl>, global_coverage_score <dbl>,
#> # expected_user_count <dbl>, user_count_residual <dbl>,
#> # coverage_score_residual <dbl>, standardized_user_count_residual <dbl>,
#> # population_lm_expected_user_count <dbl>, population_lm_residual <dbl>,
#> # residual_type <chr>, selected_residual <dbl>,
#> # benchmark_origin_flow_total <dbl>, benchmark_destination_flow_total <dbl>
#>