>>! In T151425#3503094, @Bawolff wrote:
> Hmm, the cdb thing is perhaps not the best data structure, really we should use bloom filters instead.
>
> For a "mere" 700 mb, we could have a bloom filter with a 0.01% (1 in 10,000) false positive rate containing all 306 million passwords.
>
> More realistically, 100,000 passwords is 234 kb at 0.01% false positive, 292 kb for 0.001%, 351 kb for 0.001% (1 in a million).
>
> I guess its not really clear what is an acceptable false positive rate in this context, but 1 in a million certainly seems acceptable beyond any doubt... Possibly other structures like Cuckoo filters could give even better trade-offs but i don't know much about them.
>
> https://hur.st/bloomfilter?n=100000&p=0.0001
A quick look finds numerous implementations on github, some of which are available to pull in via composer. A few aren't licensed though, so that's annoying
| name | licence | composer | packagist | serializable | reproducible build |
| ----- | ----- | ----- | ----- | ----- | ----- |
| [[https://github.com/mrspartak/php.bloom.filter|mrspartak/php.bloom.filter]] |[[https://github.com/mrspartak/php.bloom.filter/issues/9|Unknown]] | Y |[[https://github.com/mrspartak/php.bloom.filter/issues/10|N]] | Y (serialise whole object) | N |
| [[https://github.com/makinacorpus/php-bloom|makinacorpus/php-bloom]] | [[https://github.com/makinacorpus/php-bloom/issues/1|Unknown]] | Y | Y | Y | Y |
| [[https://github.com/pleonasm/bloom-filter|pleonasm/bloom-filter]] | BSD 2-clause | Y | Y | N (jsonSerialize not implemented) | |
| [[https://github.com/dsx724/php-bloom-filter|dsx724/php-bloom-filter]] | Apache License 2.0 |[[https://github.com/dsx724/php-bloom-filter/issues/6|N]] | [[https://github.com/dsx724/php-bloom-filter/issues/6|N]] | Unknown | |
| [[https://github.com/rocket-internet-berlin/RocketLabsBloomFilter|rocket-internet-berlin/RocketLabsBloomFilter]] | MIT | Y | Y | Y (can save to redis) | Y |
| [[https://github.com/maxwilms/bloom-filter|maxwilms/bloom-filter]] | MIT | Y | Y | N (?) | |