Page MenuHomePhabricator

Delete edit recovery data of logged in user when anonymous or other user interacts with website
Open, Needs TriagePublic


Feature summary:

When a user has become logged out (either, active but even through session expiration) and the database still contains data from a logged in user, delete this data.

Not sure if there is enough data in the design to determine if it came from a logged in user, but adding an anonymous vs. user flag, is the easiest to do of course. In order to deal with the 'other user' issue, we could do something like: hash userid, take the first 4 chars of the hash and use it as a vary value on the data. If these 4 chars differ, then likely it's another user. The goal is to not leak userid, but to make it likely to detect a different user than the previous user.

Another ticket theorised the if it was useful to have a small module on all pages for issues like this. I think it is. My suggestion would be to add a dedicated file for this with the most minimal checks and make it part of Then you avoid adding a module, but you still have a very high likelyhood of deleting the data.

Acceptance criteria:

  • When a logged out user visits a page, they should not have a high change of being able to recover an edit from a logged in user.
  • When another user visits a page, they should not have a high change of being able to recover an edit from a previous user.

Use cases:

  • As a user, I don't want to use a library's computer, leave and have another user find my edits.

QA notes:

Event Timeline

@Samwilson I'm adding this ticket, as I think we can do even better than T341956, T341845 and T341853 . I think it's important to do this right, due to how long this data potential can persist browser side.

Please evaluate and adapt or reject ticket as Team sees fit.

TheDJ updated the task description. (Show Details)

@TheDJ thanks for raising this!

When a user has become logged out (either, active but even through session expiration) and the database still contains data from a logged in user, delete this data.

It does sound like adding a hash of user ID to the stored data is a great way to do this. For anon users we could leave it out. Then we'd only restore if the hash matches or if the data belongs to anon.

A few things that I think are correct (but might not be):

  • Recover edits that were started while not logged in.
  • Not delete edits that were started while logged in and which still exist while logged out.
  • Not recover edits while logged out unless they belong to anon.
  • Delete other user's recovery data (but not anon's) when logging in.

Another thing to note about the privacy of recovery data is that it's not at all private if people share a computer. I agree that this doesn't mean that we have to recover it and so more easily make it visible to the other users, but we also don't want to give the impression that there's any sort of security (not that we will do, it sounds like, because we're not saying anything one way or the other).

So yeah, it sounds like we need to add user (hash of user ID) to the set of fields used as the DB key (currently pageName and section). We wouldn't expect to ever have it exist for more than two values: one user ID at a time, and null (for anon).

BTW. the reason I said "take first four chars of the hash" is because a hash is still a consistent derived value that can be brute forced by iterating over all usernames to identify someone. While a hash would avoid all collisions, a part of a hash will only have 'high chance' of detecting a collision, without being fully identifiable. There might be industry standards for these kinds of partial identification algos. Perhaps the analytics team can help, they have experience with these kinds of vectors.

Oh yep, totally agree — I meant to add that above. Will truncate it to four characters.

Actually, can you advise about the best way to compute hashes in the frontend? Or should it be done PHP-side and passed as a config var? (I can't find any examples of a similar thing anywhere…)

Change 945751 had a related patch set uploaded (by Samwilson; author: Samwilson):

[mediawiki/core@master] [WIP] editRecovery: Add user details to stored data

I do think that one part of this might not be a good idea: to delete the data when an anon user interacts with the site. The trouble with that is that if a logged in user is editing and they become logged-out, then when they load a page they'll lose all their data. Instead, I think it should keep the data, but not load it till they are logged in. Does that sound right?

Anyway, I'm focusing on adding the user identifier to the stored data for a first patch, and then will sort out each scenario of when to store/restore/delete etc.

@TheDJ: Regarding the need to hash the user ID to protect it from anon users: is this actually needed? The username is stored (in the clear) in a cookie for 30 days after logging out. I know that the lifetime of indexedDB can be longer than that, but we're going to limit it to 30 days in T341956, so perhaps it's okay to keep the username or ID in the recovery data?

so perhaps it's okay to keep the username or ID in the recovery data?

I don't think that's for me to decide.

can you advise about the best way to compute hashes in the frontend?

There aren't really built in methods I think

Maybe something like this ?

Or for user ids, you could add all the digits to eachother ?

The problem of leaving behind "private" data on a shared device is certainly widespread, via cookies, localStorage, browser cache etc. and this isn't the only case in MediaWiki.

I'm not sure it is our responsibility to try and fix it, especially as anything we do to expunge data will be insecure. A malicious user could just inspect the contents of the DB before visiting our site and letting our deletion code run.

Also any fix will likely lead to inadvertent losses of data for users who regularly switch accounts for legitimate reasons.

The more I think about this the more I wonder how to do it in an intuitive way. If we split the stored data by user there are lots of ways that we get confused:

  • If you edit while logged out, and then log in, we can use the logged-out data. That makes sense.
  • But if you edit while logged in, then your session expires and you edit some more, then you log in again, you'll have two stored records. We could look at the timestamp of each, and use the most recent and throw away the other, but that feels a bit strange. Or we could ignore the anon record, but that's then different to the behaviour in the first point.
  • If we make a distinction between logged in and out, then it seems that we'd never want the anon user to be able to delete the logged-in user's data — but that's wrong, because we do want to be able to expire old records or have the user log in as a different user and still have everything work.

Keep in mind that we're going to delete all data when (explicitly) logging out (in T341845). If we do allow the logged-in data to be restored by an anon user, it's only ever going to be done in cases when no one has actually clicked log out.

Basically, this all just feels like we're trying to pretend that the private data is safe when actually it's not at all.

(Sorry if this is a bit confusing... I'm feeling a bit confused by it!)

Samwilson added a subscriber: JWheeler-WMF.

The details of what we should do here are not fully clear. For now, the simpler thing is to postpone it until we have more certainty and continue sorting out the other outstanding bugs.

@JWheeler-WMF you might be interested in this task.

@Samwilson is the question how should Edit Recovery work when a user is programmatically logged out (ie, a session timeout?)

It sounds like we've made a stance that when a user voluntarily logs out, their edits should not persist. Therefore, if the user involuntarily logs out, it should likely follow the same precedent. Ideally, we'd inform the user their session is about to end, so their edits wouldn't be lost.

@JWheeler-WMF Yes, because if a session times out (or the cookie is deleted, or the session deleted on the server, etc.; there are various ways it might happen), then we don't want the Edit Recovery data to be deleted — it's precisely this situation that the feature aims to help with. But then we also want to minimise the risk of data leakage — and the danger there is all about multiple people sharing one browser (and the same browser profile).

There are some things that we might be able to do, such as storing the data as associated with a given username, so it wouldn't be recovered when an anon user edits the same page that a logged in user was editing, but would still be recovered when the user logs in again. That works, but it has two issues. Firstly, it'd fragment the ER data: a user could be editing, have their session broken, and keep editing as anon, only to find that we'd restore different data depending on whether they were logged in or not and it wouldn't be clear in either of those states that they had data waiting for them in the other. And secondly, the data really isn't private at all anyway even if we try to hide it — they just need to hit F12 to read the contents of the indexedDB database.

As far as informing the user, I'm not sure it's worth the complexity because we still want it to work when we can't test for a valid session (e.g. the browser is closed and then the session ends, so when we come back we are anon and want to recover). I think it'd still be good to warn users while they're editing that their session has expired; I thought there was an open task for that but I can't find it right now.

@Samwilson do we have any data to understand the rate at which a session times out, and the rate at which a session times out while there is an unsaved edit?

I think the sensible thing is to

  1. store the data associated with the given username.
  2. warn users when they're editing that their session has expired. I'm still not quite sure how to create this ticket. Can you?