There are three main requirements for the buckets:
- When these buckets are combined with our project groups, the resulting bins must be large enough to minimize re-identification risk (we don't plan to release raw answers, but this is an additional safeguard).
- According to @JAnstee_WMF, the numbers of users per bin should follow a somewhat normal distribution.
- There should be bin boundaries at 30 and 600 edits to preserve comparability with last year's data.
There are two bucket proposals right now. One creates relatively even-sized bins (e_binned_edits), which prioritizes the first criterion. The other creates relatively normal-sized bins (n_binned_edits), which prioritizes the second.
E bins
BIN EDITORS [10, 30) 2792 [30, 150) 14299 [150, 600) 14578 [600, 1350) 6953 [1350, 3800) 6873 [3800, 1100000) 6734
N bins
BIN EDITORS [10, 30) 2792 [30, 100) 9971 [100, 600) 18906 [600, 6000) 16096 [6000, 12000) 2374 [12000, 1100000) 2090
Further comparison information is in this notebook.