Page MenuHomePhabricator

CU 2.0: Persist the form state
Open, Needs TriagePublic

Description

The form's state should be persisted. When the user changes the state of the form, it should log a new request. The user must provide a new reason to update the state.

It might be best to use the Post/Redirect/Get method.

Details

Related Gerrit Patches:

Event Timeline

dbarratt created this task.Dec 3 2019, 4:08 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 3 2019, 4:08 AM

Change 554229 had a related patch set uploaded (by Dbarratt; owner: Dbarratt):
[mediawiki/extensions/CheckUser@master] Persist the form state in the user's session

https://gerrit.wikimedia.org/r/554229

dbarratt claimed this task.Dec 3 2019, 4:26 AM
dbarratt updated the task description. (Show Details)Dec 3 2019, 4:35 AM
dbarratt renamed this task from CU 2.0: Persist the form state in the user's session to CU 2.0: Persist the form state.Dec 4 2019, 6:49 PM
dbarratt moved this task from Review to In Progress on the Anti-Harassment (The Letter Song) board.

Repeating what we discussed (briefly) in an RL meeting -- We need to come up with a way to make sure the pagination works with the same state, but I am not convinced that sessions are the best way to go.

Sessions have usually a big impact on the UX;

  • Persistence across tabs -- PHP sessions are stored per browser (not tab) so they persist across tabs. Is that okay?
  • Timeout -- They time out (DefaultSettings has $wgObjectCacheSessionExpiry = 3600;, which is 1hour, which is fairly short). If the user starts an investigation, goes away for an hour and come back, the session variables would by that time expire, and we lose that state. Should we retain it? Is this scenario a normal case or do users finish investigations quicker?

I am wondering if it won't be even more straight forward to "just" send all the data directly from the form each time we paginate. We can set up a hidden field or POST param noting whether logging is required based on the UI action...

This would make it potentially more complicated, but it might be easier to control? I'm not sure here, but I think we should at least go over the impact that sessions have on the UX and see if we need something else.

dbarratt updated the task description. (Show Details)Dec 4 2019, 8:09 PM
AronManning added a subscriber: AronManning.EditedDec 4 2019, 9:07 PM

Persistence across tabs -- PHP sessions are stored per browser (not tab) so they persist across tabs. Is that okay?

The current CU UI is commonly used on many tabs with different state in complex cases to follow user1 -> ip1 -> user2 -> ip2 -> user3 -> ip3 correlations. There can be more, different investigations parallel in one session, therefore storing the pagination state per session would limit the usability of the tool for complex cases.
The pagination state is a few form values only (usernames, CU log unique key, CU log timestamp, ordering, starting index, count), keeping it client-side is not an issue.

The usernames and the CU log timestamp are persisted in the CU log, however. The usernames and the query date range (90 days before the CU log timestamp) identifies the exact result set. If the cleanup task for the 90-day old entries runs, while the investigation is in progress, those entries would disappear upon a refresh or server-side pagination. This might cause a minor confusion for the CU, but this would be a rare occurrence required by the privacy policy.

  1. Imo the CU log entry should be created when the request is sent, only once.
  2. Sorting and pagination might run another query on the CU table, but it's the same CU investigation.
  3. The results should be associated with the CU log entry created by the investigation.

I am wondering if it won't be even more straight forward to "just" send all the data directly from the form each time we paginate. We can set up a hidden field or POST param noting whether logging is required based on the UI action...

I'm going to explore tokenizing/signing the request. It looks like we are already using a JWT library in production so I can't imagine why it would be an issue to reuse that.

It looks like other extensions use that library and sign the JWT with the $wgSecretKey so I think that should be fine.

It might split this logic into a separate service if it makes sense to do that....

I'm going to explore tokenizing/signing the request. It looks like we are already using a JWT library in production so I can't imagine why it would be an issue to reuse that.

To authorize a subsequent request (sorting/paging), checking the existence of a CU log entry (with the CU user, check timestamp, target users) should be necessary and sufficient.
Even if using a token, that should be valid only for the CU doing the initial request.

To authorize a subsequent request (sorting/paging), checking the existence of a CU log entry (with the CU user, check timestamp, target users) should be necessary and sufficient.

Well that's an interesting idea... feels like a bit of an "abuse" of a log (i.e. I think an application should function even if logging is unavailable)... but certainly not unprecedented.

Even if using a token, that should be valid only for the CU doing the initial request.

Yes. and the wiki you're making the request on.

AronManning added a comment.EditedDec 6 2019, 4:23 AM

To authorize a subsequent request (sorting/paging), checking the existence of a CU log entry (with the CU user, check timestamp, target users) should be necessary and sufficient.

Well that's an interesting idea... feels like a bit of an "abuse" of a log (i.e. I think an application should function even if logging is unavailable)... but certainly not unprecedented.

CU log entries and CU checks have a 1-1 relation. Calling it a "log" is arbitrary, reflecting the original use-case. Maybe calling it "CU checks" in the DB would be more appropriate while calling it "CU log" on the UI.
I see only one limiting factor with the current cu_log table: cul_user is an N-1 relation, that cannot accommodate for multiple users checked in one go. That will require a new table (schema change).

CU log entries and CU investigations have a 1-1 relation. Calling it a "log" is arbitrary, reflecting the original use-case. Maybe calling it "CU investigations" in the DB would be more appropriate while calling it "CU log" on the UI.
I see only one limiting factor with the current cu_log table: cul_user is an N-1 relation, that cannot accommodate for multiple users checked in one investigation. That will require a new table (schema change).

Naive question: Is it a problem if we add multiple rows per the same investigation to this log?

Technically, it seems to me that this is more or less what is already done, isn't it? If an investigation about a certain IP reveals multiple users, that then require
individual "attention" by the checker, they'll review each (relevant) username, yet the investigation itself is for a single purpose.

We're trying to prevent the *need* to have multiple tabs open as you have to do more and more in depth investigations, and instead give the checkuser the data cleanly. That might mean that in the same investigation, more users are exposed -- would it not be okay to log those separately, with the same comment/investigation?

(Am I misunderstanding the distinction you're making between investigation and log?)

I've used the token method (but feel free to change it if the log method makes more sense!)

AronManning added a comment.EditedDec 6 2019, 5:13 AM

CU log entries and CU investigations have a 1-1 relation. Calling it a "log" is arbitrary, reflecting the original use-case. Maybe calling it "CU investigations" in the DB would be more appropriate while calling it "CU log" on the UI.
I see only one limiting factor with the current cu_log table: cul_user is an N-1 relation, that cannot accommodate for multiple users checked in one investigation. That will require a new table (schema change).

Naive question: Is it a problem if we add multiple rows per the same investigation to this log?
Technically, it seems to me that this is more or less what is already done, isn't it? If an investigation about a certain IP reveals multiple users, that then require
individual "attention" by the checker, they'll review each (relevant) username, yet the investigation itself is for a single purpose.

I used the wrong word "investigation", I meant "check". An investigation might require more checks. I'm editing that comment to fix this.

We're trying to prevent the *need* to have multiple tabs open as you have to do more and more in depth investigations, and instead give the checkuser the data cleanly. That might mean that in the same investigation, more users are exposed -- would it not be okay to log those separately, with the same comment/investigation?
(Am I misunderstanding the distinction you're making between investigation and log?)

As in current practice, different checks on separate tabs are different cu_log entries.
My concern is whether one check (pressing the "Submit" button, switching to "Compare" tab) for multiple users should create one or multiple cu_log entries.

Mooeypoo added a comment.EditedDec 6 2019, 7:20 AM

My concern is whether one check (pressing the "Submit" button, switching to "Compare" tab) for multiple users should create one or multiple cu_log entries.

Sorry, I'm confused, let me try and clarify for my own mental image; why would going from "submit" and into any of the "sub" details ("compare" vs details, etc) be another check? Should it be logged on its own? Is the log representing the general investigation, or specific check, and how would we define a "check" for logging purposes? It sounds like pagination change is not a different check (right?) but is going to the 'compare' tab a different check?

I'm trying to understand what the expectation is from the log, and even more, I'm trying to identify the things that are done right now (with the current operation) because they're good and define the workflow, vs things that might be done with the current system out of the (very elaborate/multiple-tab-
oriented) way it works, but might actually not be necessary if we make the workflow more streamlined.

Does this make sense?

AronManning added a comment.EditedDec 6 2019, 2:42 PM

Sorry, I'm confused, let me try and clarify for my own mental image; why would going from "submit" and into any of the "sub" details ("compare" vs details, etc) be another check?

The "Submit" button brings up the "Preliminary check" tab. As I understand the purpose of that is to not run the actual check (thus no cu log entry created). The first query is run when the "Compare" or "Timeline" tab is opened. I think the UI workflow is a bit confusing, as there is no big button that actually runs the check, or I am misunderstanding the proposed workflow.

Should it be logged on its own? Is the log representing the general investigation, or specific check, and how would we define a "check" for logging purposes? It sounds like pagination change is not a different check (right?) but is going to the 'compare' tab a different check?

Only one log entry should be made (or one per checked user) when the first actual query is run on the cu_changes table. Following queries (filtering, sorting, pagination) should not make further cu_log entries, but check the existence of the initial entry or use a token to authorize the query as dbarratt implemented.

dbarratt removed dbarratt as the assignee of this task.Dec 6 2019, 10:24 PM
dmaza claimed this task.Thu, Dec 19, 9:11 PM
dmaza moved this task from Review to In Progress on the Anti-Harassment (The Letter Song) board.
dmaza added a comment.EditedThu, Jan 2, 8:07 PM

The "Submit" button brings up the "Preliminary check" tab. As I understand the purpose of that is to not run the actual check (thus no cu log entry created). The first query is run when the "Compare" or "Timeline" tab is opened. I think the UI workflow is a bit confusing, as there is no big button that actually runs the check, or I am misunderstanding the proposed workflow.

Last I heard hitting "Submit" on the "Preliminary check" tab is part of the check and thus a cu log entry will be created. Switching between tabs, paginating and filtering will not create a new log entry.
FWIW, logging is beyond the scope of this task.


@dbarratt
To summarize what has been discussed, these are our options so far.
1.- Store investigation params in session with a unique ID so to allow multiple investigations at the same time. This "ID" will be passed around via url across pages as a "token".
Session data would look something like below where abc, def are two different investigations happening at the same time.

[ 'abc' => [ 'users' => [ 'userA', 'userB' ], ... ] ], 
[ 'def' => [ 'users' => [ 'userC', 'userD' ], ... ] ],

2.- Create a token and encrypt it and pass it around as a url param. It was mentioned that this is not "private" enough since the request url is logged and the token will be available in other places. (Even when encrypted)
3.- Pass around the log id (read investigation record id) and hit the db on every page load 🙊. This db trip should be cached and shouldn't impact performance. Something to consider here is how to handle users accessing an existing record(Investigation) originally started by a different user (for example, if a CheckUser shares the investigation url with another CheckUser)
4.- Send everything (or the sensitive bits) via POST. This would involve modifying the pager and filters to post investigation params on every request.

2.- Create a token and encrypt it and pass it around as a url param. It was mentioned that this is not "private" enough since the request url is logged and the token will be available in other places. (Even when encrypted)

Why wouldn't it be private enough if it was encrypted? The only people that would be able to decrypt it would also have access to the production database and have the information that is being queried anyways (as well as the CheckUser log)

dmaza added a comment.Thu, Jan 2, 8:31 PM

2.- Create a token and encrypt it and pass it around as a url param. It was mentioned that this is not "private" enough since the request url is logged and the token will be available in other places. (Even when encrypted)

Why wouldn't it be private enough if it was encrypted? The only people that would be able to decrypt it would also have access to the production database and have the information that is being queried anyways (as well as the CheckUser log)

I agree with you but that's what I was told. Maybe @Mooeypoo knows someone who can "vet" it as "all good" ?

dmaza reassigned this task from dmaza to dbarratt.Thu, Jan 2, 9:00 PM
dmaza added a subscriber: dmaza.

Change 561890 had a related patch set uploaded (by Tchanders; owner: Tchanders):
[mediawiki/extensions/CheckUser@master] Override navigation links in Special:Investigate to POST a request

https://gerrit.wikimedia.org/r/561890

Mooeypoo added a subscriber: sbassett.EditedTue, Jan 7, 4:01 PM

Tagging Security-Team for advice here. I think @sbassett helped us before (you might be a bit more in the loop on this?) but wasn't sure who is best to tag from your team, so I am tagging the team in general.

Quick reiterated intro -- CheckUser is being redesigned with a more streamlined form and result view. We are trying to see how to consolidate the pagination; we'll need to send the server some information about the investigation + the page the user is on to get the new result query for every page. There are several options that we've been reviewing on how to store and/or send this information, with a big emphasis on security, and considering some comments we've received offline and casually.

Note: From the summary below, we think #2 is the best option. If having a URL parameter token that is encrypted is okay with security concerns, that is our best route forward.

That said, if #2 is not a good option, we want to make sure there are no specific security concerns with any of the options mentioned.

The summary of the options we're considering:

To summarize what has been discussed, these are our options so far.
1.- Store investigation params in session with a unique ID so to allow multiple investigations at the same time. This "ID" will be passed around via url across pages as a "token".
Session data would look something like below where abc, def are two different investigations happening at the same time.

[ 'abc' => [ 'users' => [ 'userA', 'userB' ], ... ] ], 
[ 'def' => [ 'users' => [ 'userC', 'userD' ], ... ] ],

2.- Create a token and encrypt it and pass it around as a url param. It was mentioned that this is not "private" enough since the request url is logged and the token will be available in other places. (Even when encrypted)
3.- Pass around the log id (read investigation record id) and hit the db on every page load 🙊. This db trip should be cached and shouldn't impact performance. Something to consider here is how to handle users accessing an existing record(Investigation) originally started by a different user (for example, if a CheckUser shares the investigation url with another CheckUser)
4.- Send everything (or the sensitive bits) via POST. This would involve modifying the pager and filters to post investigation params on every request.

sbassett moved this task from Incoming to Watching on the Security-Team board.

Change 563195 had a related patch set uploaded (by Tchanders; owner: Tchanders):
[mediawiki/extensions/CheckUser@master] [WIP alternative] Override navigation links in Special:Investigate to POST a request

https://gerrit.wikimedia.org/r/563195

sbassett added a comment.EditedThu, Jan 9, 11:21 PM

@Mooeypoo @dbarratt et al - a few initial security points/concerns for option 2:

  1. firebase/php-jwt seems like a mature, stable JWT lib, especially since WMF already deploys 5.0.0. And jwt.io even likes it. A minor nit might be pinning to a specific version within composer.json on the patch, since that tends to be a bit more secure and would match what would need to be in mediawiki/vendor.
  2. I noticed in the latest PS on the patch there was a TODO around setting an exp field in the payload. I'd strongly advise this be done and then the tokens properly invalidated after said expiry, otherwise I'd place a higher risk rating (medium+) for this option. I don't have enough experience with CU to know for certain, but it seems like there could be a theft/replay risk with folks who have both checkuser and checkuser-log rights and we'd want to minimize this attack surface as much as we can. I think we'd consider these to be very trusted users, but such assumptions aren't really part of a robust security model.
  3. It might be a good idea to programmatically check for the existence of the encryption algorithm as we do in core in addition to the $secret check within the TokenManager constructor.
  4. I'm not certain I love the way the IV seed is being created in the patch, though it's probably good enough for this application.
  1. firebase/php-jwt seems like a mature, stable JWT lib, especially since WMF already deploys 5.0.0. And jwt.io even likes it. A minor nit might be pinning to a specific version within composer.json on the patch, since that tends to be a bit more secure and would match what would need to be in mediawiki/vendor.

We can do this, but there is a downside to 3rd parties:

You can specify the exact version of a package. This will tell Composer to install this version and this version only. If other dependencies require a different version, the solver will ultimately fail and abort any install or update procedures.

https://getcomposer.org/doc/articles/versions.md#exact-version-constraint
meaning if we require 5.0.0 and another extension requires 5.0.1 then they wont be able to install either extension because both are declaring incompatible versions. This is why it is better to specify a compatible versions as a bound range than to specify an exact version. The exact version (as well as the hash) is maintained by the project's composer.lock file which should be committed to the VCS. I'm fine though if we want to reduce the range to something like 5.0.* (instead of 5.*) which is equivalent to ~5.0.0.

We can bind it if you think it's necessary, but I wanted to highlight the downside of doing that, as it could prevent 3rd party wikis from pushing security fixes (from the library) quickly.

  1. I noticed in the latest PS on the patch there was a TODO around setting an exp field in the payload. I'd strongly advise this be done and then the tokens properly invalidated after said expiry, otherwise I'd place a higher risk rating (medium+) for this option. I don't have enough experience with CU to know for certain, but it seems like there could be a theft/replay risk with folks who have both checkuser and checkuser-log rights and we'd want to minimize this attack surface as much as we can. I think we'd consider these to be very trusted users, but such assumptions aren't really part of a robust security model.

That's fine. @Niharika do you have a preference for how long investigations should last before they have to be redone? 24 hours? less?

  1. It might be a good idea to programmatically check for the existence of the encryption algorithm as we do in core in addition to the $secret check within the TokenManager constructor.

Ah! sure that's a good idea. It's too bad that code in core isn't usable outside of the session handler. :/

  1. I'm not certain I love the way the IV seed is being created in the patch, though it's probably good enough for this application.

Do you have an alternative example?

We can do this, but there is a downside to 3rd parties:
We can bind it if you think it's necessary, but I wanted to highlight the downside of doing that, as it could prevent 3rd party wikis from pushing security fixes (from the library) quickly.

Yes, there will always be trade-offs when it comes to security best practices concerning vendor code. As someone coming at this purely from the security side, I would always recommend pinning versions (or at least using ~ instead of ^ semver ranges), if possible, as there's less opportunity for breaking changes to occur and for potentially dangerous (even malicious) code to accidentally be introduced. Of course, pinning and tilde ranges can also hinder or prevent upgrading to newer versions where security issues have been patched. It's not that big of a deal here for extensions, since the packages will need to be pinned in wmf production anyways.

That's fine. @Niharika do you have a preference for how long investigations should last before they have to be redone? 24 hours? less?

Ideally, this would be the least amount of time possible without making it grossly irritating for CUs to perform these investigations. 24 hours seems a little high, but that's just my arbitrary opinion, and it would certainly be better than nothing. And as long as the code is consistently expiring with exp, I don't think we'd need to worry about using unique jti values, which seems like potential over-engineering for this use-case, especially since the payloads will be encrypted.

  1. I'm not certain I love the way the IV seed is being created in the patch, though it's probably good enough for this application.

Do you have an alternative example?

I don't think I have a simple solution :) My main concern here is that since the implementation is exposed via FLOSS dev, we should probably try to add more entropy to the seed, outside of just trivially-found values like wikiId and getName(). It might be nice to have some kind of service (or even a db table) that could generate random, persistent seed values per-user for situations like this, though that's probably a non-starter at the moment. Otherwise, perhaps concatenating more immutable get* data from the User class or mw global variables.

Further researching implementations similar to option 2, could the token be passed around in a response header as opposed to a url param? That would vastly reduce its exposure in various logs, etc.

Further researching implementations similar to option 2, could the token be passed around in a response header as opposed to a url param? That would vastly reduce its exposure in various logs, etc.

It would need to be passed in the URL because, as far as I know, it's impossible to to set the headers on a simple anchor element. If the sensitive data is encrypted, does it make a difference? Technically we are intentionally exposing this data in the CheckUserLog, so anyone with checkuser permission can see the unencrypted data anyways (and obviously anyone with production database access).

errrrr... to further clarify, we like #2 as a solution over #3 because it allows us a place to put things like filters that are not necessarily in the log itself, but could reveal information. For instance, if a CheckUser filters by a User Agent, on its own, that's not PII, but if you knew what user they were investigating at the time of the request, you could pretty easily put two and two together. These filters aren't currently recorded in the CheckUserLog. However, again, if someone has checkuser permission, they could repeat the investigation and they would have access to all of the data that was encrypted in the token in the first place.

That's fine. @Niharika do you have a preference for how long investigations should last before they have to be redone? 24 hours? less?

@dbarratt Can you elaborate a bit on what the user-facing impact of such an expiry would be? Will the session expire after 24 hours and they will have to re-initiate the investigation? Will their already open tabs be affected?

@dbarratt Can you elaborate a bit on what the user-facing impact of such an expiry would be?

I was thinking it would ignore an expired token and bring you back to the investigation page, but it could have some other behavior like displaying an error and telling the user to start a new investigation.

This time limit is known, so we could display this to the user (like a countdown clock or something)?

Will the session expire after 24 hours and they will have to re-initiate the investigation?

Correct, or whatever time limit we set.

Will their already open tabs be affected?

Yes and No. Their open tabs wont change until they click on a link (pagination/tab) or change a filter.

It would need to be passed in the URL because, as far as I know, it's impossible to to set the headers on a simple anchor element.

Fair enough, I wasn't clear on that piece of the implementation.

If the sensitive data is encrypted, does it make a difference? Technically we are intentionally exposing this data in the CheckUserLog, so anyone with checkuser permission can see the unencrypted data anyways (and obviously anyone with production database access).

For now, no, though passing around tokens as URL params does have the potential to expose them in other places, e.g. webrequest and similar logs. And we also don't want to increase the attack surface here (if possible) just because we do it somewhere else, especially given the portable nature of JWTs.

errrrr... to further clarify, we like #2 as a solution over #3 because it allows us a place to put things like filters that are not necessarily in the log itself, but could reveal information. For instance, if a CheckUser filters by a User Agent, on its own, that's not PII, but if you knew what user they were investigating at the time of the request, you could pretty easily put two and two together. These filters aren't currently recorded in the CheckUserLog. However, again, if someone has checkuser permission, they could repeat the investigation and they would have access to all of the data that was encrypted in the token in the first place.

This is understandable, but also comes with increased risk with larger quantities and types of potentially-sensitive data being stored within JWTs. Again, of the approaches defined above, this is probably the best and cleanest option for a number of reasons. The Security-Team is mainly interested in clarifying and documenting potential risks.

@dbarratt Got it. I think we can go with 24-hours as a starting point and iterate as needed.

Further researching implementations similar to option 2, could the token be passed around in a response header as opposed to a url param? That would vastly reduce its exposure in various logs, etc.

@sbassett If the question is whether we can do a POST request to avoid passing information via the URL, then it is possible. This is option #4 from the list in T239680#5782446, implemented in https://gerrit.wikimedia.org/r/#/c/561890/

Tagging @Krinkle who also expressed concern with passing the information via the URL.

sbassett added a comment.EditedTue, Jan 14, 3:57 PM

@Tchanders - are you proposing to pass the JWT around via POST or only the relevant data (likely the encrypted payload from the JWT option)? I suppose either would provide for a slightly smaller attack surface than sending any token/data as a query param. Though using JWTs seems like it might be a bit cleaner, however I'm not sure if that also entails a reworking of the pager and filters as mentioned for option 4 in T239680#5772095.

@sbassett I may have misunderstood your original question - I thought you were asking in general whether there was a way to solve our problem by sending information via headers rather than URLs. I was just restating the proposal in #4, to get to different pages by re-posting the original form, passing the relevant paging params via hidden fields.

Changes to the pager would entail intercepting the anchors' click events (as per current implementation) or replacing them with submit buttons. Filters would presumably become additional form fields.

We proposed option #4 in case passing information in the URL and exposing it in the logs was deemed too large a risk; however, it sounds like this is not the case.

We proposed option #4 in case passing information in the URL and exposing it in the logs was deemed too large a risk; however, it sounds like this is not the case.

Encapsulated within a JWT with an encrypted payload and expiration, it shouldn't be too large of a risk to pass it as a query param, IMO. But passing a token via POST or as a header would be better still, though not a hard requirement. Not sure if others (@Krinkle) might feel differently.

@sbassett Thanks for clarifying.

I've updated the JWT's payload, which now looks like this:

{
  "iss": "local",
  "sub": "Jim Gordon",
  "exp": 1579233518,
  "data": "u4QjmB7yHcgwfAbdQf+dMImkUTA4jQL9dMmkBQey2K8="
}