Page MenuHomePhabricator

[SPIKE] Determine what – if any – changes need to be made to editing-related schemas to handle temporary accounts
Closed, ResolvedPublic

Description

T324492 will introduce a new account type: temporary accounts.

This task involves the work of identifying what – if any – changes will need to be made to editing-related schemas so that the Editing Team can differentiate between people editing with different account types [i] for the purposes of aggregating the editing behavior of these different account types and making decisions in response to the conclusions we draw from these aggregates.

Decisions to be made

  • What – if any – changes need to be made to editing-related schemas to handle temporary accounts?

Open questions

  • 1. Might it be possible for the updates required to be made in a general users table in the data lake, such that no schema-specific changes would be required? @DLynch posed this question offline.

i. : Logged out, temporary account, logged in. Note: it strikes me that the notion of a "logged out editor" will no longer exist so maybe we can remove this altogether.

Event Timeline

Might it be possible for the updates required to be made in a general users table in the data lake, such that no schema-specific changes would be required? @DLynch posed this question offline.

As far as I can tell, looking at what's available to me in Superset, no such general users table exists. But if I'm wrong, and it exists and can be joined with our event data, then wouldn't need to do anything (we have a user_id field that could be used for joins, for temp users as well). @MNeisler Can you confirm that this isn't feasible?


I had a look at this. I was hoping to find that someone has already figured it out for us and that we could just copy their solution, but it seems no one has. But there is some related work that we can take inspiration from:

  • T308017 added some schemas for relaying MediaWiki edits as events, including a reusable fragment defining a user entity with a is_temp field. However, I'm not sure if it would make sense to complicate our legacy schemas by reusing it, and it would force us to generate and store data for other fields in that entity that we don't need.
  • T333223 proposes adding a user_is_temp field to the core MediaWiki user table. It also advises against recognizing temp users by their name pattern.

With this in mind, this is the solution I like the most:

  • We could add user_is_temp to our schemas. It's a small simple change, which is good for legacy code. It'll be easy to use in queries. And it can be copied to other legacy schemas (including by other teams) with minimal effort.

Other ideas that I don't like:

  • We could extend the user_class field we already have. Currently it can be "IP", "bot" or empty (and we never actually record "bot"). This is probably a bad idea for reasons noted in T308017#8310120 ("I don't think it's future-proof to add user_type since if we decide we want such a concept in MW core, the details may differ from what you decide on here") [I found a link to that comment in T333223]. Also, no other schema has a field like that.
  • We could add user_name and recognize the temp user names in queries. This is probably a bad idea for reasons noted in T333223. Also, we would have to worry about potentially recording abusive user names, and unpleasant surprises when they pop up while exploring the data.

@DLynch @MNeisler Thoughts?

Adding user_is_temp and filling it with whatever the server/client-side $user->isTemporary() says at that time seems reasonable to me. As a minor advantage over joining with a hypothetically-existing users table, it'd future-proof us against needing to make changes if a conversion-to-permanent-account workflow is created later.

Thanks @matmarex and @DLynch. I took a look through some of the proposed changes related to this work. Some comments below:

As far as I can tell, looking at what's available to me in Superset, no such general users table exists. But if I'm wrong, and it exists and can be joined with our event data, then wouldn't need to do anything (we have a user_id field that could be used for joins, for temp users as well). @MNeisler Can you confirm that this isn't feasible?

There is a core Mediawiki user table, which can be queried from in Superset (it's called wmf_raw.mediawiki_user in the presto_analytics_hive database ). Per T332205#8703448, all temp accounts will have a user id assigned so I could use that to join to EditAttemptStep to determine if the user is temporary.

This is feasible but would complicate and slow down queries, especially from within Superset. In addition, if Data Engineering decides not to add an explicit user_is_temp flag to identify temp accounts, we would need to run a regex on the username column, which would result in the issues already mentioned in the above comments and in T333223 (such as, the prefix for temp users might be subjective to change).

We could add user_is_temp to our schemas. It's a small simple change, which is good for legacy code. It'll be easy to use in queries. And it can be copied to other legacy schemas (including by other teams) with minimal effort.

This seems like a reasonable approach to me as well and more reliable than the other options proposed. It would make querying much more simple compared to trying to use a join to another table and avoid a lot of the potential problems caused by trying to use a regex on a new user name field.

Adding user_is_temp and filling it with whatever the server/client-side $user->isTemporary()

If we use this method, would the new user_is_temp column in the editing schemas align with the flag being proposed by Data Engineering in T333223 or is it possible I may get a different result if I join to the proposed user_is_temp field in the user table.

Just want to make sure temp users are being consistently defined across the various tables.

If we use this method, would the new user_is_temp column in the editing schemas align with the flag being proposed by Data Engineering in T333223 or is it possible I may get a different result if I join to the proposed user_is_temp field in the user table.

There's not a specific implementation to compare it to yet, but I don't think there'd be any difference. The only way it might ever drift apart is if a way to convert a user from temporary to permanent ever gets added -- in that case it'd be like the edit_count field, where the value stored in the editing schema is a snapshot from the time of the edit, whereas the value in the user table would be the value at the time you're running the query. (But there's not currently a plan for that migration to be possible, as I understand it.)

matmarex moved this task from Doing to Ready for Sign Off on the Editing-team (Kanban Board) board.

Thanks, I think that's settled then, I'll write the patches to add the user_is_temp fields.

There is a core Mediawiki user table, which can be queried from in Superset (it's called wmf_raw.mediawiki_user in the presto_analytics_hive database )

That's neat, I didn't know that they were available in Superset, this will probably come in handy in the future.