>>! In T151425#3503094, @Bawolff wrote:
> Hmm, the cdb thing is perhaps not the best data structure, really we should use bloom filters instead.
>
> For a "mere" 700 mb, we could have a bloom filter with a 0.01% (1 in 10,000) false positive rate containing all 306 million passwords.
>
> More realistically, 100,000 passwords is 234 kb at 0.01% false positive, 292 kb for 0.001%, 351 kb for 0.001% (1 in a million).
>
> I guess its not really clear what is an acceptable false positive rate in this context, but 1 in a million certainly seems acceptable beyond any doubt... Possibly other structures like Cuckoo filters could give even better trade-offs but i don't know much about them.
>
> https://hur.st/bloomfilter?n=100000&p=0.0001
A quick look finds numerous implementations on github, some of which are available to pull in via composer. A few aren't licensed though, so that's annoying
- https://github.com/mrspartak/php.bloom.filter
- https://github.com/makinacorpus/php-bloom
- https://github.com/pleonasm/bloom-filter
- https://github.com/dsx724/php-bloom-filter (no composer)
- https://github.com/rocket-internet-berlin/RocketLabsBloomFilter
- https://github.com/maxwilms/bloom-filter