Measure active-user versus population distribution bias
Source:R/measure_bias.R
measure_bias_distribution.RdCompares the spatial distribution of active users with the benchmark
population distribution. This complements measure_bias(), which
reports area-level coverage rates. Here the target is distributional
imbalance: whether active users are allocated across areas in the same
proportions as the benchmark population.
The directional metric is: $$KL(population || active users)$$ and Jensen-Shannon divergence is returned as a symmetric companion metric. Lower values mean the active-user distribution is closer to the population distribution.
Usage
measure_bias_distribution(
coverage_df,
area_col = "origin",
population_col = "population",
user_count_col = "user_count",
epsilon = 1e-08,
return_area_level = TRUE
)Arguments
- coverage_df
A data frame with one row per area, containing an area identifier, benchmark population, and active-user count.
- area_col
Area identifier column. Default
"origin".- population_col
Population column. Default
"population".- user_count_col
Active-user count column. Default
"user_count".- epsilon
Small positive smoothing constant added before shares are computed. Default
1e-8.- return_area_level
Logical; return one row per area in the output. Default
TRUE.
Value
A list with:
summary: one-row tibble with KL, JSD, totals, and share difference summaries,area_level: area-level tibble with population/user shares and KL/JSD contributions whenreturn_area_level = TRUE.
Examples
data(simulated_coverage)
measure_bias_distribution(simulated_coverage)
#> $summary
#> # A tibble: 1 × 11
#> comparison reference_distribution comparison_distribut…¹ n_areas
#> <chr> <chr> <chr> <int>
#> 1 active_users_vs_populat… population active_users 328
#> # ℹ abbreviated name: ¹comparison_distribution
#> # ℹ 7 more variables: total_population <dbl>, total_user_count <dbl>,
#> # epsilon <dbl>, kl_population_user <dbl>, jsd_population_user <dbl>,
#> # mean_abs_share_difference <dbl>, max_abs_share_difference <dbl>
#>
#> $area_level
#> # A tibble: 328 × 12
#> area population user_count coverage_score coverage_bias bias
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Adur 4975 558 0.112 0.888 0.888
#> 2 Allerdale 6750 1005 0.149 0.851 0.851
#> 3 Amber Valley 9334 808 0.0866 0.913 0.913
#> 4 Arun 13750 1416 0.103 0.897 0.897
#> 5 Ashfield 9323 1280 0.137 0.863 0.863
#> 6 Ashford 12042 1287 0.107 0.893 0.893
#> 7 Babergh 7491 449 0.0599 0.940 0.940
#> 8 Barking and Dagenham 20930 1279 0.0611 0.939 0.939
#> 9 Barnet 45399 2723 0.0600 0.940 0.940
#> 10 Barnsley 17559 1349 0.0768 0.923 0.923
#> # ℹ 318 more rows
#> # ℹ 6 more variables: population_share <dbl>, user_share <dbl>,
#> # share_difference_user_minus_population <dbl>, midpoint_share <dbl>,
#> # kl_contribution <dbl>, jsd_contribution <dbl>
#>