Page MenuHomePhabricator

Service for checking the Pwned Passwords database
Open, LowPublic

Description

The Pwned Passwords database is probably the most comprehensive collection of known-bad passwords that are either simple enough to be included in password dictionaries, or have been stolen in the process of some site compromise and then published. The database consists of 500 million password hashes and takes up 9 GB, so it is not trivial to query effectively; it would be great if it was available to Wikimedia wikis as a service (using the word loosely; maybe just a MySQL database that MediaWiki can connect to, maybe something more dedicated).

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I spent a bit of time looking into this over the weekend, mostly seeing what other people have come up with.

For the generic problem of doing a prefix search over a ton of strings I found https://stackoverflow.com/questions/41664257/prefix-search-against-half-a-billion-strings which had some more resources.

I think we'd want to implement a local version of the v2 range search API: https://haveibeenpwned.com/API/v2#PwnedPasswords implementing the same kAnonymity model that they rely on. Other MW users can point the relevent code at the haveibeenpwned hosted API, while we or others who care about it can run a locally hosted version.

I wrote a simple HTTP API wrapper around sgrep at https://git.legoktm.com/legoktm/PawnedPasswords implementing the same API as the PwnedPasswords one. It's about 50 lines of Python, and only has a framework dependency on flask, that probably could be removed if it matters and use plain CGI instead.

Note that this operates on the uncompressed dump, which is ~22G.

How much do we want to engineer this? Is this a suitable solution for the task?

According to Logstash there are about 10 logins and account creations per second. (Grafana seems to say about 30, I never could figure out what's the correct way to read it though so I'd trust Logstash more.) So maybe put it on a production-like server and see how well it takes that kind of load, and what the response time is?

I wrote a simple HTTP API wrapper around sgrep at https://git.legoktm.com/legoktm/PawnedPasswords implementing the same API as the PwnedPasswords one. It's about 50 lines of Python, and only has a framework dependency on flask, that probably could be removed if it matters and use plain CGI instead.

Flask is perfectly ok; however

Note that this operates on the uncompressed dump, which is ~22G.

Deploying 22 GB of text to the production cluster is not something I'm enthusiastic about. And give you use sgrep, you clearly expect access as local storage.

How much do we want to engineer this? Is this a suitable solution for the task?

More or less, yes, minus the file access part. We need a better solution if we want to run this in production.

Uh, just stick it into MySQL and be done with it? :P

Uh, just stick it into MySQL and be done with it? :P

Well, that won't allow using sgrep as the PoC @Legoktm created does, but sure, that could be more than ok.

Sgrep is just binary search, databases have indexes for that.

If we have the data locally (in our prod env), perhaps having a prefix-search-tree-based solution would be the way to go.

What's the benefit of having a service for this? These are just static files.

Given the recent events, I have recently been thinking on implementing such checking myself, but had considered just making doing that on php with a file_get_contents(). If the files are available on the node, it's just a mmap. Doing the binary search in php won't be too pretty, but there's no need for the service. And the code is easily available for third-party installs, too.

What's the benefit of having a service for this? These are just static files.

Given the recent events, I have recently been thinking on implementing such checking myself, but had considered just making doing that on php with a file_get_contents(). If the files are available on the node, it's just a mmap. Doing the binary search in php won't be too pretty, but there's no need for the service. And the code is easily available for third-party installs, too.

Well I see several reasons why having a service is a better idea in general, even letting that aside, one of the best properties mediawiki has for horizontal scalability is *it doesn't need to access any local filesystem*.

Also, consider this: if you had to read a 22 GB binary file from php with file_get_contents, that means using more than 22 GB of memory space *for that request*. Loading the file in RAM (so inside APC) would also be unpractical, as we'd waste 22 * 221 GB of precious RAM.

I am ok with importing this data in a database or any other data store that makes it easy to search it, and then running the queries from mediawiki directly. I am even more ok with us writing a simple service that can be expanded to do more verification and also can be used by things different from MediaWiki.

Sgrep is just binary search, databases have indexes for that.

Thanks @MaxSem for explaining that to me! I wasn't aware ;)

I was just saying we need some more work than @Legoktm's simple proof of concept.

Sorry, I was thinking the download link was providing split files like those used in the k-anonymization api (eg. https://api.pwnedpasswords.com/range/7110e), not a big 22GB file but 16⁵ files of about 22KB each.

Also, I hadn't taken into account that on v2 they also include the number of times that it has appeared.

I will try to prepare a PoC myself.

Ok, I have been preparing the mentioned poc. Should I directly request a new extension repository for it?
As for the associated storage requirements, they easily go down to 9GB. We should decide if we prefer using 256 files of 37M, or 64K files of ~140K each.
(we could as well work with 1048576 text files of ~18K, but that would go up to 20GB!)

Ok, I have been preparing the mentioned poc. Should I directly request a new extension repository for it?
As for the associated storage requirements, they easily go down to 9GB. We should decide if we prefer using 256 files of 37M, or 64K files of ~140K each.
(we could as well work with 1048576 text files of ~18K, but that would go up to 20GB!)

Sorry, maybe I wasn't clear enough.

While doing this as an external service is just my preference, not using local storage is a hard requirement. I still think mysql could be a good place to store the information and query it, but i'm agnostic regarding the type of datastore we want to use as long as it outperforms a RDBMS.

We are not going to deploy another multipe-gb file hierarchy to all machines that might need to run the service.

To make an example, small amounts of local storage are ok (think of GeoIP data), but I don't think this qualifies as small.

While doing this as an external service is just my preference, not using local storage is a hard requirement.

Is that a hard requirement for long-term setups, or a hard requirement period? Because we have an ongoing attack that is wasting man-hours, causing significant anguish to the community, and even got us featured in the tech press, and this is one the better shots to mitigate it. So this might not be the best time for non-critical architectural concerns.

I still think mysql could be a good place to store the information and query it

How quickly can we have a table set up in wikishared and fill the 22G data into it? Code-wise using the DB would be a very simple solution, but in my experience anything involving schema changes tends to take quite a while.

While doing this as an external service is just my preference, not using local storage is a hard requirement.

Is that a hard requirement for long-term setups, or a hard requirement period? Because we have an ongoing attack that is wasting man-hours, causing significant anguish to the community, and even got us featured in the tech press, and this is one the better shots to mitigate it. So this might not be the best time for non-critical architectural concerns.

In an emergency we can do whatever we need to, but I don't think it's even thinkable on the short term to have to somehow rsync 22gb more to all appservers and verify any reimage/new server is in sync.

Also, it compeltely escapes me how copying the files to local storage woudl solve the problem faster than loading them into *a* datastore.

I still think mysql could be a good place to store the information and query it

How quickly can we have a table set up in wikishared and fill the 22G data into it? Code-wise using the DB would be a very simple solution, but in my experience anything involving schema changes tends to take quite a while.

I think these data should sit in a separate database instance, not one of the main ones.

Code-wise using the DB would be a very simple solution, but in my experience anything involving schema changes tends to take quite a while.

Creating new tables is an exception, that can be done about as fast as you can get approval. You'd still want to run it by DBA to make sure the schema makes sense.

Straw proposal:

We need a database table for a new service. It doesn't depend on any other tables, and is a single instance (not per wiki). It would look something like

CREATE TABLE pwned_passwords (
    -- First 20 bits of a SHA1 hash, encoded as an integer 0-1048575.
    -- (external compatibility requires 20 rather than 16 or 32)
    pp_prefix INTEGER NOT NULL,

    -- Remaining 140 bits of the SHA1 hash, encoded as "base-256".
    pp_suffix BINARY(18) NOT NULL,

    -- Count
    pp_count INTEGER NOT NULL,

    PRIMARY KEY (pp_prefix, pp_suffix)
);

As an alternative to the multi-column PK, we could add an auto-increment key and have an index just on pp_prefix.
The table will have ??? rows. Someone supply that number, my random guess is around 500 million.

Queries from the service would look like this:

SELECT pp_suffix, pp_count FROM pwned_passwords WHERE pp_prefix = ?;

The service will always want all rows for a single prefix and will never filter by the other fields. I don't know whether the service would do the query with ORDER BY or would sort client-side. Someone answer that

The service will do no inserts, updates, or deletions. Bulk data updates may be needed from time to time, which would most likely involve updates to pp_count of existing rows and insertions of new rows.

Just a note that while it's taken 5 years for that number of records to go from 154 million to 517 million, some of those increases are 80 million records in a day, let's plan for that. (See https://www.troyhunt.com/ive-just-added-2844-new-data-breaches-with-80m-records-to-have-i-been-pwned/ for more.)

I wasn't aware of this task, but I've contacted the Security team few months ago with more or less the same idea. Hence here my a-bit-more-than-2 cents:

Internal custom API

  • The backend of this shouldn't be something related to MediaWiki but a standalone service usable by any other tool inside the WMF Prod infrastructure so that we could reuse it. The language to use to write it might depends on the data storage chosen (see below).
  • It should have a very basic API:
    • GET /$SHA1 => 200 if found, 404 if not found, in both cases empty body response (or optionally the $COUNT provided in the latest version of the files on 200, TBD).
  • I personally deem secure for internal services to ask through HTTPS directly the SHA1 of the password without the added complexity of the K-anonymity. The SHA1 should not be persisted in the access logs, at most a prefix of it. But of course the K-anonymity approach is still an alternative if the above is not deemed secure enough. But it might affect the choice of the data storage.

MediaWiki side

Given that MediaWiki should work both inside our infrastructure and in any other installation, I think it should implement two possible backends to choose from via configuration:

  1. Our custom backend described above (to be used in prod)
  2. The HIBP public API using the K-Anonymity model (to be used by who doesn't want to install and maintain the internal service):

I'm not entering into details on the UI side of it, but ideally it should enforce that the password used is not in the file for new registrations and password changes. While it should probably just suggest a password change for existing users at login time.
For suggestions on integrations practices see also https://www.troyhunt.com/introducing-306-million-freely-downloadable-pwned-passwords/
To be decided if MediaWiki should check it for all logins or "store" the result of the check with the version of the HIBP file given that both user's passwords and the HIBP file don't change often. But then how to check the current version used for the public HIBP API without making a request each time?

Data storage

I didn't had time to spend on this yet so I don't have benchmarks but I think we should aim for having the data in memory as much as possible to achieve good performances, no matter the choice of the data storage.
The natural and quicker solution to me seems still to be MySQL, we should aim to have the primary key index fit in memory. If even more performances are needed we could also consider the memcached plugin for InnoDB ( https://dev.mysql.com/doc/refman/5.6/en/innodb-memcached.html ) but that has other complications (not available in MariaDB, lack of ACLs, etc.)
Given the particular nature of the data (static + periodic total reimport once or twice a year) I think it would be better to not have any replication at the MySQL layer and just have one (or two) servers per datacenter or, if re-using an existing cluster, evaluating to add this new database to the replication ignore list and import the data on all hosts in the cluster without writing to the binlog file.

The typical workflow could be something like:

  • periodically monitor the latest version of the file available
  • download the file from HIBP, they strongly suggest to use torrent but there is a standard link sponsored by CloudFlare.
  • mangle the data so that can be more easily imported into the DB (it depends on the solution chosen) in a centralized place
  • import the data into a table on all hosts with the version of the data in the name, without touching the existing one used by the production service
  • change the configuration of the API service to the new version and deploy it (testing first ofc ;) )
  • once all good, after few days, drop the old table

Given that MediaWiki should work both inside our infrastructure and in any other installation, I think it should implement two possible backends to choose from via configuration:

Consideration should be given to whether having to maintain two backends is worth whatever savings come from not implementing the local service as being compatible with api.pwnedpasswords.com.

Note that any other user that intends to work for third parties might have to implement both backends as well, or else have a dependency on our bespoke pwned passwords service.

I'm not entering into details on the UI side of it, but ideally it should enforce that the password used is not in the file for new registrations and password changes. While it should probably just suggest a password change for existing users at login time.

The UI in MediaWiki already exists as part of AuthManager. This would most like plug in by adding a check to $wgPasswordPolicy.

but I think we should aim for having the data in memory as much as possible to achieve good performances, no matter the choice of the data storage

22GB of data in memory?

but I think we should aim for having the data in memory as much as possible to achieve good performances, no matter the choice of the data storage

22GB of data in memory?

The 22GB number is the uncompressed version of an ASCII text file that looks something like the following:
000000005AD76BD555C1D6D771DE417A4B87E4B4:4
00000000A8DAE4228F821FB418F59826079BF368:2
00000000DD7F2A1C68A35673713783CA390C9E93:630

<hex SHA-1>:<popularity>

Not an ideal format for storing compactly in memory.

Some napkin math:
There are approx ~500 million SHA1s in there. Let's say we don't care about popularity for now. Even if you don't do any tricks (e.g. sharing prefixes), 20 bytes (a binary SHA1) * 500 million =~ 10 gigabytes, as a flat table in memory you could just binary-search. That's not really that much, IMO, and there's plenty of options for shrinking it further.

Consideration should be given to whether having to maintain two backends is worth whatever savings come from not implementing the local service as being compatible with api.pwnedpasswords.com.

Sure, given the simplicity of the direct API call it didn't seem to me much of an overhead, and on the other side it should allow to optimize things a bit (client just get a boolean response, no additional logic on the client side, much less network traffic client-API, etc...). That said I'm open for both options.

but I think we should aim for having the data in memory as much as possible to achieve good performances, no matter the choice of the data storage

22GB of data in memory?

We have a lot of servers with 64+ GB (up to much more ;) ), so yeah I don't see it as a problem.

@Joe: That's not a problem. Actually, not using local storage is actually easier. It will simply be fetching everything. I still think a cache should be available for who want it. In my tests, fetching from the internet makes it ~1.4s slower, but should be no problem on the same dc.

@Volans: Those are just 8.8 GB, discarding counts and storing in binary. That means not using the k-anonymity files as the upstream version, but the size gains are worth it imho (and no, I don't want to make the code deal with nibbles!).

The dump is actually static, maybe next year Troy will release a V3 if there are significant changes, but it's not a source that really needs to be automatically loaded when there is a new release.

I have requested a repository for uploading the parts that I made. It still needs to register as a MediaWiki extension a and $wgPasswordPolicy provider, but the basics of checking if a password is on a dump do work. :)

  • It should have a very basic API:
    • GET /$SHA1 => 200 if found, 404 if not found, in both cases empty body response (or optionally the $COUNT provided in the latest version of the files on 200, TBD).

Even HEAD /{sha1} should be enough for our needs.

  1. Our custom backend described above (to be used in prod)
  2. The HIBP public API using the K-Anonymity model (to be used by who doesn't want to install and maintain the internal service):

I think we can get away with only one interface/comm class. The key parts in both cases are:

  • the input needed to make a request comprises $url, $sha1 and $prefix (boolean, whether to send just a prefix or the whole SHA1)
  • if $prefix == true, check the response body for the specific hash, otherwise look at the return code

Even if we do make two implementations, the differences are so minimal that I don't think special efforts are needed to maintain one vs two.

I didn't had time to spend on this yet so I don't have benchmarks but I think we should aim for having the data in memory as much as possible to achieve good performances, no matter the choice of the data storage.
The natural and quicker solution to me seems still to be MySQL, we should aim to have the primary key index fit in memory. If even more performances are needed we could also consider the memcached plugin for InnoDB ( https://dev.mysql.com/doc/refman/5.6/en/innodb-memcached.html ) but that has other complications (not available in MariaDB, lack of ACLs, etc.)
Given the particular nature of the data (static + periodic total reimport once or twice a year) I think it would be better to not have any replication at the MySQL layer and just have one (or two) servers per datacenter or, if re-using an existing cluster, evaluating to add this new database to the replication ignore list and import the data on all hosts in the cluster without writing to the binlog file.

I agree with @Anomie's suggestion about the schema in T189641#4785967 with one slight modification: I'd put the whole SHA1 as the second field as this would allow for a more performant look-up in the service itself as it could simply ask for the (prefix, sha1) tuple.

However, I do think we should think about a multi-DC-capable solution in this case. We don't want DC switches, outages and other concerns influence, compromise or complicate the security aspects of this mechanism.

However, I do think we should think about a multi-DC-capable solution in this case. We don't want DC switches, outages and other concerns influence, compromise or complicate the security aspects of this mechanism.

It's a completely stateless service, so seems to me very easy to make it replicated and active/active.

Change 476870 had a related patch set uploaded (by Platonides; owner: Platonides):
[mediawiki/extensions/PwnedPasswords@master] Initial files

https://gerrit.wikimedia.org/r/476870

Change 476870 merged by MarcoAurelio:
[mediawiki/extensions/PwnedPasswords@master] Initial files

https://gerrit.wikimedia.org/r/476870

I agree with @Anomie's suggestion about the schema in T189641#4785967 with one slight modification: I'd put the whole SHA1 as the second field as this would allow for a more performant look-up in the service itself as it could simply ask for the (prefix, sha1) tuple.

If we're going to go with the non-K-anonymized local service, there's probably no point in having pp_prefix at all.

But no matter what, there's probably no point in having both pp_prefix and the full 160 bit hash in pp_suffix. All you'd be saving is allowing the service to do something like hex2bin( $sha1 ) instead of hex2bin( substr( $sha1, 5 ) ), at the cost of having to store more data (2 bytes × 500 million rows, plus probably the same in the index on the field).

@Platonides @MarcoAurelio I see you added a reference to this task in the extension you just created.

We were collectively pretty clear on the fact that:

  • a majority of people here, including those responsible for the production infrastructure, think having a standalone service is a better option
  • we definitely do not want to access files from local storage, but rather from a database.

It doesn't seem to me that the extension in its current state meets either of those requirements, so I don't see it as an immediate, practical solution to our current problem; what are your plans?

@Joe Did you actually read the code?

(sure, you will want to change line 11 to set cache to false, but I hadn't time before your message to mention here what I did)

We can discuss the merits of the code I wrote a few days ago, even scratch it if it's not useful, use it as the basis of a "standalone service to grab all password hashes", or simply be an available solution for third-party wikis.

By the way, the password files do constitute a database, it is just not a relational one.

(starting again to present the code)

I have uploaded to https://gerrit.wikimedia.org/g/mediawiki/extensions/PwnedPasswords/ the code I mentioned.

It is currently little more than a proof of concept, but provides a useful basis for solving the stated problem. Some of the current shortcomings:

  • It isn't a real MediaWiki extension ye, nor does is hook with the rest of MediaWiki
  • Configuration is hardcoded on the constructor
  • It provides a way for checking passwords, but doesn't hook as a provider that can be used as login
  • It should accept multiple urls, so it doesn't depend on a single server (it fails open, though). Not sure if it is worth putting a LB on that.

The current plans were just to present it here and gather what other people think about this approach.

@Platonides @MarcoAurelio I see you added a reference to this task in the extension you just created.

We were collectively pretty clear on the fact that:

  • a majority of people here, including those responsible for the production infrastructure, think having a standalone service is a better option
  • we definitely do not want to access files from local storage, but rather from a database.

It doesn't seem to me that the extension in its current state meets either of those requirements, so I don't see it as an immediate, practical solution to our current problem; what are your plans?

I am not responsible for that extension nor I created the repository. I just merged the patch because it had a dependent patch that was blocked by that one because it modified the code of one of the files (and there was no CI setup able to merge then even with a +2 on them). Considering that they're initial files (thus obviously IMHO cannot be expected to be a fully functional and working extension), I did so as I think that's what we usually do (publish what we've got first, then tweak the codebase). I have later setup CI for that repo and tested if it worked, which it did. That ended my involvement with those patches. Apologies if I did something wrong. Regards.

The dump is actually static, maybe next year Troy will release a V3 if there are significant changes, but it's not a source that really needs to be automatically loaded when there is a new release.

FWIW the API version and the dump version are unrelated, and the dump is already at V3 (was released in July). But indeed it does not change terribly often.

I still think that given we have a working implementation from @Legoktm (thanks!), and there is an ongoing attack, we should get that deployed ASAP (putting it on a single server suffices, no need to overdo it - maybe on some of the ORES boxes, since those already involve Python and huge files), and we can discuss a better setup later. Perfect is very much the enemy of the good here.

Although personally I don't see the big difference between a local file and a database (which is ultimately also a file, just somewhere else). The dump changes about twice a year, that can be handled with a deployment-like process (again, ORES does git-lfs, that would probably be appropriate here), no need for any sophisticated sync process. A DB index probably performs better than plain binary search, but the difference might not be relevant. Remember, this serves logins, not pageviews or anything like that; we don't have too many of those.

Adding @JBennett as I see he raised concern with the ihaveibeenpwned work in related https://phabricator.wikimedia.org/T210192#4786955

I think that was meant for using it for email lookup, not passwords? If your password hash is in the dump, the applicability of that is pretty clear.

We were collectively pretty clear on the fact that:

  • a majority of people here, including those responsible for the production infrastructure, think having a standalone service is a better option
  • we definitely do not want to access files from local storage, but rather from a database.

Do we want to put the integration point in MediaWiki core? IMO that only makes sense if we push for it to be default (e.g. you can enable the check as an installer option). I'm not sure we want that - telling a third-party service which of your accounts are easy to hack feels iffy, even if it's a small wiki and a reasonably well trusted service (T188868 has more detail on the risks). And options that require setting up a local service obviously should not be installer options.

So if we don't want the integration to live in core, we obviously need a MediaWiki extension, even if all it does is talk to the service. And once we have an extension it makes sense for it to support (besides haveibeenpwned.com and whatever service the WMF ends up with) a simple local option that works in most third-party setups. Since it has to be simple, that will be either a DB table or files (or both) - I'm not sure about the relative benefits of those options, I can imagine exciting performance problems when you have one million files in a directory, but then this would be intended for small setups where performance is not really a concern since they have just a few logins per day.

Adding @JBennett as I see he raised concern with the ihaveibeenpwned work in related https://phabricator.wikimedia.org/T210192#4786955

I think that was meant for using it for email lookup, not passwords? If your password hash is in the dump, the applicability of that is pretty clear.

If i'm following all of this correctly, the proposal is to add a dump from hibp and compare credentials against it. Essentially taking our 100K bad password list and increasing it by 490 million. Is that a fair summary?

Couple questions/comments that I have that are probably covered elsewhere but I'll ask as it's not clear to me from the dialogue so far.
*is this an extension or additional logic or both??
*My gut feeling is there might be some legal issues and maybe some weird attribution problems if we expose this for others to use or add it as an extension. I'll ask for their feedback here.
*What happens when a credential matches a bad cred here? Are the forced to reset or what happens there? What about users w/out email?

I think before this goes wheels up it needs a bit of review.

Essentially taking our 100K bad password list and increasing it by 490 million. Is that a fair summary?

Sort of although the implementation will be separate, in part because the bad password list has passwords and the haveibeenpwned list has hashes, but mainly due to the different scale. The pwned password list would involve a standalone service, for the bad password list there was no need of that. (OTOH I don't know how useful the bad password list would be if we have the pwned password one. Wasn't it mainly sourced from the same breaches?)

Also we don't have to use the two checks in the exact same way, we could be more lenient with one than the other for example.

is this an extension or additional logic or both??

I think either of those is an option (no need for both at the same time). I would go with an extension, per T189641#4789876.

My gut feeling is there might be some legal issues and maybe some weird attribution problems if we expose this for others to use or add it as an extension.

I don't see any point in exposing the service. It's a security risk (it could be a DDoS target for example) and haveibeenpwned.com already exists, providing this service for others would add very little value.

The extension would not include the password hash dump, if for no other reason then because we can't bundle 10GB of hashes with an extension, so users would have to get that for themselves (from haveibeenpwned.com, which is hosted by Microsoft IIRC). I don't think we can run into any legal issues just by publishing code that interfaces with that service / dump (which is a high-profile public service which no doubt passed Microsoft legal review). It wouldn't even necessarily be published by the WMF as such.

What happens when a credential matches a bad cred here? Are the forced to reset or what happens there? What about users w/out email?

We have been talking about two different use cases (the first one long-term and not really related to current events, the second as an immediate mitigation):

  • Use the list in our standard password validity checks (refuse new account registrations with those passwords, nag users to change the password when they log in with it). This does not involve emails; we trust that the user who is logging in is legitimate.
  • Use the list in the ongoing attack as a temporary mitigation. This would probably take the form of an email-based second factor when someone with a privileged or extended-confirmed account logs in with a password that is on the list. (Possibly only when they fail the LoginNotify "login from an IP this account has not used recently" check, if the code for that's easy to reuse.) If the account does not have an email address, they will be locked out for the time being and can contact someone for account recovery if they can prove their identity, same as with lost passwords, or wait out the attack. (Or I guess we can just let logins with no email succeed and hope for the best.)

Do we want to put the integration point in MediaWiki core? IMO that only makes sense if we push for it to be default (e.g. you can enable the check as an installer option).

It could easily enough be in core without being enabled by default.

Whether in core or an extension, the most sensible integration point with the rest of MediaWiki would be $wgPasswordPolicy. We might enable it by default (maybe only for some groups) pointing at api.pwnedpasswords.com, or we might have it in $wgPasswordPolicy['checks'] without it being enabled by default for any group (and in that case we might require the user specifically configure it), or we might have the user install an extension that adds to $wgPasswordPolicy from its extension.json.

Even in core, a configuration option to use a local copy of the download without having it be behind a service could be one of the configuration options, although maybe not a great one.

If i'm following all of this correctly, the proposal is to add a dump from hibp and compare credentials against it. Essentially taking our 100K bad password list and increasing it by 490 million. Is that a fair summary?

That seems to be an accurate high-level summary. The implementation would, of course, differ.

*is this an extension or additional logic or both??

It could be done either way.

*My gut feeling is there might be some legal issues and maybe some weird attribution problems if we expose this for others to use or add it as an extension. I'll ask for their feedback here.

That strikes me as very unlikely, but I suppose it doesn't hurt to ask actual lawyers. We wouldn't distribute the dump ourselves, and I don't think anyone plans to make the proposed service accessible to the public either. And the dump is freely available from https://haveibeenpwned.com/Passwords (under a CC BY-4.0 license even!).

If the feature would have the ability to query api.pwnedpasswords.com directly, their acceptable use policy looks like it would be very easy to comply with (for any factors remotely under our control).

*What happens when a credential matches a bad cred here? Are the forced to reset or what happens there? What about users w/out email?

If it's plugged in via $wgPasswordPolicy in the manner I think it should be,

  • On login with a matching password, they'd be shown an interstitial[1] prompting them to change their password. The interstitial would have a "skip" button,[2] in which case they'd be prompted again the next time they log in.
  • On account creation and on password change, any matching password would be rejected.

In all these cases the check provides the message to display to the user, so there will be no trouble about customizing the messaging to be specific to this check (unless you want the body of the message to differ between login/creation/change for some reason).

It doesn't matter here whether the user has provided an email address or not. I don't see any existing code to directly implement the idea suggested elsewhere about having it trigger an email-based 2FA, so that might need to be added if we would want that.

[1]: The interstitial would be similar to the one that's currently displayed to ask for a 2FA code. Or you could trigger it on a local testing wiki by setting your password then adjusting $wgPasswordPolicy so that password is no longer valid.
[2]: The code displaying that interstitial already has the ability to disallow skipping, but there's currently no way for a check in $wgPasswordPolicy to have it do so.

It could easily enough be in core without being enabled by default.

It could, sure. I just don't see the point of it being in core if it's not easy to enable. If it uses a DB table, we'll probably want a maintenance script to create it from the standard dump file. If it uses a million small files, likewise. Core has enough maintenance creep already. If you need to consider complex security trade-offs to decide whether to use the official API, we want to document those at a place that's more accessible and more open to participation than a doc comment in DefaultSettings.php.

Even in core, a configuration option to use a local copy of the download without having it be behind a service could be one of the configuration options, although maybe not a great one.

I don't think loading a 22GB file into PHP and doing substring searches on it is reasonable. It would have to be preprocessed into a tree of smaller files, or a bunch of CDB files, or something like that. Which is a lot of complexity and should not be in core IMO. If we want to make string password checks part of the default "security experience" (which is probably a good idea), IMO it's better to push for a bundled extension.

I don't see any existing code to directly implement the idea suggested elsewhere about having it trigger an email-based 2FA, so that might need to be added if we would want that.

EmailAuth is in production, just not enabled correctly currently.

I don't think loading a 22GB file into PHP and doing substring searches on it is reasonable. It would have to be preprocessed into a tree of smaller files, or a bunch of CDB files, or something like that. Which is a lot of complexity and should not be in core IMO. If we want to make string password checks part of the default "security experience" (which is probably a good idea), IMO it's better to push for a bundled extension.

If we want this to be part of the default "security experience", the default should be to use api.pwnedpasswords.com regardless of whether it's in core or an extension. Unless you're thinking the extension is going to include the 10GB or 20GB of preprocessed data?

I don't see much difference between having a maintenance script to preprocess the download in core versus having it in an extension.

I don't see any existing code to directly implement the idea suggested elsewhere about having it trigger an email-based 2FA, so that might need to be added if we would want that.

EmailAuth is in production, just not enabled correctly.

My point was that there's no way for a password policy check function to directly tell EmailAuth to activate. It looks like check function would have to know about the 'EmailAuthRequireToken' hook and hook it when the check fails, which in any case seems like a bit of a weird code-flow.

What would make a bit more sense to me would be for there to be a 'do-extra-2FA' signal "defined" like the existing 'reset-pass' signal. We could then somehow propagate extra flags from the password policy checks (an associative array in $status->value?)[1] that AbstractPasswordPrimaryAuthenticationProvider would check to know whether to set that signal. Then various SecondaryAuthenticationProviders could check the signal and take appropriate action (including clearing of the signal). So OATHAuth could always run and clear the signal, while EmailAuth could run after OATHAuth and only run if the signal (or its hook) says to.

[1]: That would also be a way to let checks indicate that 'reset-pass' should be hard rather than soft.

If we want this to be part of the default "security experience", the default should be to use api.pwnedpasswords.com regardless of whether it's in core or an extension.

Sure. But using that involves non-trivial tradeoffs (you are sending your password hashes to a third party - k-anonimity offers very little defense if the service is malicious) which is why I have misgivings about making it an easily enabled core option, with the more secure alternative being less easy to enable (or even discover).

I don't see much difference between having a maintenance script to preprocess the download in core versus having it in an extension.

Core maintenance scripts tend to bitrot faster than extension maintenance scripts, IMO. Feature creep made the core maintenance script repo pretty much unmaintainable.

My point was that there's no way for a password policy check function to directly tell EmailAuth to activate. It looks like check function would have to know about the 'EmailAuthRequireToken' hook and hook it when the check fails, which in any case seems like a bit of a weird code-flow.

The way I'd do it is to have the PwnedPasswords extension provide the password check as a service (with in-process caching), and invoke that in the EmailAuthRequireToken handler (which would be in CommonSettings.php, or the private equivalent - the main goal of EmailAuth was to provide the ability for fairly targeted intervention in case of an attack, so the triggering code would always live in site configuration).

What would make a bit more sense to me would be for there to be a 'do-extra-2FA' signal "defined" like the existing 'reset-pass' signal. We could then somehow propagate extra flags from the password policy checks (an associative array in $status->value?)[1] that AbstractPasswordPrimaryAuthenticationProvider would check to know whether to set that signal. Then various SecondaryAuthenticationProviders could check the signal and take appropriate action (including clearing of the signal). So OATHAuth could always run and clear the signal, while EmailAuth could run after OATHAuth and only run if the signal (or its hook) says to.

If we want to do email 2FA under normal circumstances, that would be the way to go, yes. The hook exists so you can easily change the conditions without having to deploy an update to the extension code.

Sure. But using that involves non-trivial tradeoffs (you are sending your password hashes to a third party - k-anonimity offers very little defense if the service is malicious) which is why I have misgivings about making it an easily enabled core option, with the more secure alternative being less easy to enable (or even discover).

Why "k-anonimity offers very little defense"?

Why "k-anonimity offers very little defense"?

It only means the attacker has to try k passwords instead of one. With k being a few hundred, that only makes a realistic difference if you use super strict rules for bad logins (e.g. lock out the user completely after some amount of tries, like banks do). I guess a wiki could use very aggressive bad password throttling specifically for login attempts with weak passwords (a few per day or month) and in that case it would make a difference, but not with the default setup.

Why "k-anonimity offers very little defense"?

It only means the attacker has to try k passwords instead of one. With k being a few hundred, that only makes a realistic difference if you use super strict rules for bad logins (e.g. lock out the user completely after some amount of tries, like banks do). I guess a wiki could use very aggressive bad password throttling specifically for login attempts with weak passwords (a few per day or month) and in that case it would make a difference, but not with the default setup.

But you're sending to this API only the first 5 char of the SHA1, that returns ~500 SHA1 hashes that match that prefix and the client checks if the hash of the password under test is part of this list or not. So the malicious service doesn't know if the check was positive or negative.

The malicious service at most knows that you tested a password of which the first 5 char of its SHA1 are known. It could, of course, work under the assumption that all the tested passwords are positive (but it doesn't know it), and start trying all the ~500 passwords on the website that made the check, but for which username? It doesn't have any info about those, so it should try them all.

Moreover, on account creation/change password if the password is in the list it should be forbidden to use it, so not risk here.
The only potential risk I see is for login of pre-existing users, if the password matches and the user is not forced to change it on the spot, practically limited only to smaller wikis (not too many users) for which the list of usernames is public or easily deductible.

In this very specific case a malicious service could start trying the ~500 passwords for all the usernames and at some point get lucky.
I have no idea how many MediaWiki installation could fit those assumptions, but is a threat for which is easy to protect from:
When using the public HIBP API we could not allow to skip the alert that tells you to change password, forcing the user to change it on the spot.

@Volans: I responded in T188868#4790640 to avoid going off track here.

"WMF bans every human-readable password". That's not quite true but it'll be close enough for practical purposes. A virtual riot by those who have that requirement imposed on them would be one of the correct responses to banning half a billion passwords with a view to extending that to billions in the future. Maybe nobody has thought to put that into plain language yet.

chasemp added a project: Security-Team.