Implement bloom filter for popular password password lists
Open, Needs TriagePublic

Description

Hmm, the cdb thing is perhaps not the best data structure, really we should use bloom filters instead.

For a "mere" 700 mb, we could have a bloom filter with a 0.01% (1 in 10,000) false positive rate containing all 306 million passwords.

More realistically, 100,000 passwords is 234 kb at 0.01% false positive, 292 kb for 0.001%, 351 kb for 0.001% (1 in a million).

I guess its not really clear what is an acceptable false positive rate in this context, but 1 in a million certainly seems acceptable beyond any doubt... Possibly other structures like Cuckoo filters could give even better trade-offs but i don't know much about them.

https://hur.st/bloomfilter?n=100000&p=0.0001

A quick look finds numerous implementations on github, some of which are available to pull in via composer. A few aren't licensed though, so that's annoying

namelicencelast commitcomposerpackagistserializablereproducible buildfile sizemin php version
mrspartak/php.bloom.filterUnknown04/04/2015YNY (serialise whole object)N1.4M5.3
makinacorpus/php-bloomMIT30/11/2017YYYY236K5.6
pleonasm/bloom-filterBSD 2-clause23/10/2017YYYY240K5.4
dsx724/php-bloom-filterApache License 2.003/10/2014NNNN/A
rocket-internet-berlin/RocketLabsBloomFilterMIT18/05/2017YYY (can save to redis too)Y176K
maxwilms/bloom-filterMIT15/09/2015YYYY176K5.4
Reedy created this task.Sep 6 2017, 4:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 6 2017, 4:23 PM
Reedy updated the task description. (Show Details)Sep 6 2017, 4:29 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 5:57 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 6:35 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 7:05 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 7:10 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 7:15 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 8:28 PM
Reedy updated the task description. (Show Details)
Reedy updated the task description. (Show Details)Sep 12 2017, 9:24 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 9:54 PM
Reedy updated the task description. (Show Details)
Reedy updated the task description. (Show Details)Sep 12 2017, 10:02 PM
Reedy updated the task description. (Show Details)Sep 12 2017, 10:04 PM

Looks like based on looking up 1 value, the difference between implementations is at best, minimal

Running PHP version 5.6.30 (x86_64) on Darwin 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64

Building MakinaCorpus bloom filter for 100000 passwords in 3.6 seconds
0 missing items from the bloom filter.
Testing MakinaCorpus bloom filter for 100000 passwords in 3.5 seconds
MakinaCorpus
   times: 100000
   total:   3.44ms
     min:   0.00ms
  median:   0.00ms
    mean:   0.00ms
     max:   0.00ms
Current memory usage: 41.75 megabytes
   Peak memory usage: 42.75 megabytes


Building MaxWilms bloom filter for 100000 passwords in 1.3 seconds
0 missing items from the bloom filter.
Testing MaxWilms bloom filter for 100000 passwords in 1.3 seconds
MaxWilms
   times: 100000
   total:   1.24ms
     min:   0.00ms
  median:   0.00ms
    mean:   0.00ms
     max:   0.00ms
Current memory usage: 42 megabytes
   Peak memory usage: 43 megabytes


Building MrSpartak bloom filter for 100000 passwords in 4.7 seconds
0 missing items from the bloom filter.
Testing MrSpartak bloom filter for 100000 passwords in 4.8 seconds
MrSpartak
   times: 100000
   total:   4.67ms
     min:   0.00ms
  median:   0.00ms
    mean:   0.00ms
     max:   0.00ms
Current memory usage: 43.5 megabytes
   Peak memory usage: 44.5 megabytes


Building Pleonasm bloom filter for 100000 passwords in 2.7 seconds
0 missing items from the bloom filter.
Testing Pleonasm bloom filter for 100000 passwords in 2.5 seconds
Pleonasm
   times: 100000
   total:   2.40ms
     min:   0.00ms
  median:   0.00ms
    mean:   0.00ms
     max:   0.00ms
Current memory usage: 43.5 megabytes
   Peak memory usage: 45 megabytes


Building RocketLabs bloom filter for 100000 passwords in 3.6 seconds
0 missing items from the bloom filter.
Testing RocketLabs bloom filter for 100000 passwords in 5.1 seconds
RocketLabs
   times: 100000
   total:   5.03ms
     min:   0.00ms
  median:   0.00ms
    mean:   0.00ms
     max:   0.02ms
Current memory usage: 42.5 megabytes
   Peak memory usage: 45 megabytes


Building dsx724 bloom filter for 100000 passwords in 1.1 seconds
0 missing items from the bloom filter.
Testing dsx724 bloom filter for 100000 passwords in 1.7 seconds
dsx724
   times: 100000
   total:   1.59ms
     min:   0.00ms
  median:   0.00ms
    mean:   0.00ms
     max:   0.00ms
Current memory usage: 43 megabytes
   Peak memory usage: 45 megabytes
Reedy updated the task description. (Show Details)Sep 12 2017, 10:06 PM

https://github.com/igoreus/bloomfilter is most popular on packagist and https://github.com/rocket-internet-berlin/RocketLabsBloomFilter is second

One seems be a fork of the other... But I'm not sure why there's two like this...

https://github.com/reedy/mw-password-bloom-filter/blob/master/output/RocketLabs.ser doesn't seem very git friendly though

Reedy updated the task description. (Show Details)Sep 12 2017, 10:50 PM
Reedy updated the task description. (Show Details)Sep 13 2017, 3:47 PM
Reedy updated the task description. (Show Details)Sep 19 2017, 11:19 AM
Reedy updated the task description. (Show Details)Oct 3 2017, 2:00 AM
Reedy updated the task description. (Show Details)Oct 3 2017, 2:03 AM
Reedy updated the task description. (Show Details)Oct 3 2017, 2:08 AM
Reedy updated the task description. (Show Details)
Reedy updated the task description. (Show Details)Oct 8 2017, 8:56 AM
Reedy updated the task description. (Show Details)Dec 21 2017, 12:49 AM