Implement bloom filter for popular password password lists
Closed, ResolvedPublic

Description

Hmm, the cdb thing is perhaps not the best data structure, really we should use bloom filters instead.

For a "mere" 700 mb, we could have a bloom filter with a 0.01% (1 in 10,000) false positive rate containing all 306 million passwords.

More realistically, 100,000 passwords is 234 kb at 0.01% false positive, 292 kb for 0.001%, 351 kb for 0.001% (1 in a million).

I guess its not really clear what is an acceptable false positive rate in this context, but 1 in a million certainly seems acceptable beyond any doubt... Possibly other structures like Cuckoo filters could give even better trade-offs but i don't know much about them.

https://hur.st/bloomfilter?n=100000&p=0.0001

A quick look finds numerous implementations on github, some of which are available to pull in via composer. A few aren't licensed though, so that's annoying

namelicencelast commitcomposerpackagistserializablereproducible buildfile sizemin php version
mrspartak/php.bloom.filterUnknown04/04/2015YNY (serialise whole object)N1.4M5.3
makinacorpus/php-bloomMIT30/11/2017YYYY236K5.6
pleonasm/bloom-filterBSD 2-clause23/10/2017YYYY240K5.4
dsx724/php-bloom-filterApache License 2.003/10/2014NNNN/A
rocket-internet-berlin/RocketLabsBloomFilterMIT18/05/2017YYY (can save to redis too)Y176K
maxwilms/bloom-filterMIT15/09/2015YYYY176K5.4
Reedy created this task.Sep 6 2017, 4:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 6 2017, 4:23 PM
Reedy updated the task description. (Show Details)Sep 6 2017, 4:29 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 5:57 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 6:35 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 7:05 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 7:10 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 7:15 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 8:28 PM
Reedy updated the task description. (Show Details)
Reedy updated the task description. (Show Details)Sep 12 2017, 9:24 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 9:54 PM
Reedy updated the task description. (Show Details)
Reedy updated the task description. (Show Details)Sep 12 2017, 10:02 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 10:04 PM

Looks like based on looking up 1 value, the difference between implementations is at best, minimal

Running PHP version 5.6.30 (x86_64) on Darwin 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64

Building MakinaCorpus bloom filter for 100000 passwords in 3.6 seconds
0 missing items from the bloom filter.
Testing MakinaCorpus bloom filter for 100000 passwords in 3.5 seconds
MakinaCorpus
   times: 100000
   total:   3.44ms
     min:   0.00ms
  median:   0.00ms
    mean:   0.00ms
     max:   0.00ms
Current memory usage: 41.75 megabytes
   Peak memory usage: 42.75 megabytes


Building MaxWilms bloom filter for 100000 passwords in 1.3 seconds
0 missing items from the bloom filter.
Testing MaxWilms bloom filter for 100000 passwords in 1.3 seconds
MaxWilms
   times: 100000
   total:   1.24ms
     min:   0.00ms
  median:   0.00ms
    mean:   0.00ms
     max:   0.00ms
Current memory usage: 42 megabytes
   Peak memory usage: 43 megabytes


Building MrSpartak bloom filter for 100000 passwords in 4.7 seconds
0 missing items from the bloom filter.
Testing MrSpartak bloom filter for 100000 passwords in 4.8 seconds
MrSpartak
   times: 100000
   total:   4.67ms
     min:   0.00ms
  median:   0.00ms
    mean:   0.00ms
     max:   0.00ms
Current memory usage: 43.5 megabytes
   Peak memory usage: 44.5 megabytes


Building Pleonasm bloom filter for 100000 passwords in 2.7 seconds
0 missing items from the bloom filter.
Testing Pleonasm bloom filter for 100000 passwords in 2.5 seconds
Pleonasm
   times: 100000
   total:   2.40ms
     min:   0.00ms
  median:   0.00ms
    mean:   0.00ms
     max:   0.00ms
Current memory usage: 43.5 megabytes
   Peak memory usage: 45 megabytes


Building RocketLabs bloom filter for 100000 passwords in 3.6 seconds
0 missing items from the bloom filter.
Testing RocketLabs bloom filter for 100000 passwords in 5.1 seconds
RocketLabs
   times: 100000
   total:   5.03ms
     min:   0.00ms
  median:   0.00ms
    mean:   0.00ms
     max:   0.02ms
Current memory usage: 42.5 megabytes
   Peak memory usage: 45 megabytes


Building dsx724 bloom filter for 100000 passwords in 1.1 seconds
0 missing items from the bloom filter.
Testing dsx724 bloom filter for 100000 passwords in 1.7 seconds
dsx724
   times: 100000
   total:   1.59ms
     min:   0.00ms
  median:   0.00ms
    mean:   0.00ms
     max:   0.00ms
Current memory usage: 43 megabytes
   Peak memory usage: 45 megabytes
Reedy updated the task description. (Show Details)Sep 12 2017, 10:06 PM

https://github.com/igoreus/bloomfilter is most popular on packagist and https://github.com/rocket-internet-berlin/RocketLabsBloomFilter is second

One seems be a fork of the other... But I'm not sure why there's two like this...

https://github.com/reedy/mw-password-bloom-filter/blob/master/output/RocketLabs.ser doesn't seem very git friendly though

Reedy updated the task description. (Show Details)Sep 12 2017, 10:50 PM
Reedy updated the task description. (Show Details)Sep 13 2017, 3:47 PM
Reedy updated the task description. (Show Details)Sep 19 2017, 11:19 AM
Reedy updated the task description. (Show Details)Oct 3 2017, 2:00 AM
Reedy updated the task description. (Show Details)Oct 3 2017, 2:03 AM
Reedy updated the task description. (Show Details)Oct 3 2017, 2:08 AM
Reedy updated the task description. (Show Details)
Reedy updated the task description. (Show Details)Oct 8 2017, 8:56 AM
Reedy updated the task description. (Show Details)Dec 21 2017, 12:49 AM
Reedy closed this task as Resolved.Feb 26 2018, 1:08 AM
Reedy claimed this task.

wikimedia/password-blacklist lives