Page MenuHomePhabricator

Provide attribute to indicate that user is temporary account in exported content
Open, Needs TriagePublic

Description

Context

Currently, exporting an article that has edits from a temporary account, there is no structured way to determine that the user account is a temporary account instead of a named, permanent one:

<revision>
      <contributor>
        <username>~2024-2796</username>
        <id>1881</id>
      </contributor>

Proposal

We could update the <contributor> property to include an <is_temp> field and set it to 1 if the user is a temp account, and 0 otherwise.

Consequences

  • Dump sizes become larger, one additional line for every user account referenced in an exported page
  • Structured mechanism for determining if a contributor of a revision is a temporary account

Event Timeline

kostajh renamed this task from Provide attrribute to indicate that user is temporary account in exported content to Provide attribute to indicate that user is temporary account in exported content.Thu, May 23, 11:41 AM
kostajh edited projects, added Data Products, Dumps-Generation; removed Data-Engineering.

For fun here's some context I have:

We ran into a similar issue when we were designing the page change event schema in T308017.
There is a larger issue about how to represent user types in MW too: T336176: MediaWiki user types

As noted in that ticket in T336176#8855477, we decided to go with boolean fields for this, even though we would have preferred something more flexible and comprehensive.

You can see how we are representing user types in the mediawiki/state/entity/user schema fragment.


All that is to say, adding a user_is_temp or is_temp boolean field to exported content makes sense to me. I don't know who would be responsible for implementing this though. Data Products and @xcollazo might know?

BTW, this is the first I've learned of the term 'contributor' (I haven't really used dumps myself).

That's too bad, because now there's yet another term for this concept.

https://wikitech.wikimedia.org/wiki/Data_modeling_guidelines#performer_vs._actor_vs._user

actor was added to Mediawiki core in 2018 to represent either a user or IP address in order to save space in large tables like logging and revision.

performer is used in the Event Platform to indicate the user performing the action. This was chosen because it was already in use in Mediawiki core, in LogEntry and RecentChange. It was a surprise to the designers that actor was already in use.

And, in revision's case, in events, a revision has an editor field, which I think maps to dump's contributor here.

I think we'll want to update Data Modeling Guidelines to document and disambiguate contributor. I'll do that soon when I work on resolving the outstanding comments about the guidlines.

BTW, this is the first I've learned of the term 'contributor' (I haven't really used dumps myself).

FWIW, that term goes back to at least 2015 (and probably 2003 according to the old Export.php.)

For fun here's some context I have:

We ran into a similar issue when we were designing the page change event schema in T308017.
There is a larger issue about how to represent user types in MW too: T336176: MediaWiki user types

As noted in that ticket in T336176#8855477, we decided to go with boolean fields for this, even though we would have preferred something more flexible and comprehensive.

You can see how we are representing user types in the mediawiki/state/entity/user schema fragment.


All that is to say, adding a user_is_temp or is_temp boolean field to exported content makes sense to me. I don't know who would be responsible for implementing this though. Data Products and @xcollazo might know?

We could definitely incorporate this into Dumps 2.0 once it is available in the user schema fragment (and thus part of https://schema.wikimedia.org/repositories/primary/jsonschema/mediawiki/page/change/current.yaml) .

once it is available in the user schema fragment

Oh, it's already there @xcollazo !

Ah, my bad. Ok then it should be easy to incorporate this into Dumps 2.0.

Considering that the work to add one boolean is the same to add N booleans, I wonder if the others would be interesting to a dumps consumer and not just is_temp:

is_bot:
  description: >
    True if this user is considered to be a bot at the time of this event.
    This is checked via the $user->isBot() method, which considers both
    user_groups and user permissions.
  type: boolean

is_system:
  description: >
    True if the user is a MediaWiki 'system' user. These are users that
    cannot 'authenticate'.  These are usually listed in ReservedUsernames.
  type: boolean

is_temp:
  description: >
    True if the user is an autocreated temporary MediaWiki user.
    This is used for IP masking.
  type: boolean

Additionally, we don't have one for is_anonymous, which is unfortunately codified in the dumps on the user_id in a very confusing way:

user_id                     BIGINT    COMMENT 'id of the user that made the revision; null if anonymous, zero if old system user, and -1 when deleted or malformed XML was imported',

I wonder if the others would be interesting to a dumps consumer

Sure, why not!

is_anonymous

I can't recall all the discussions, but I think anonymous is a bit of ambiguous term, especially with is_temp now. Comments in T336176: MediaWiki user types might have more details. There was a lot of discussion around this term with the IP masking / temp user project. Would be worth asking those folks if they have definitions documented somewhere.

Probably safest not to add this one for now.

@lbowmaker this should be included in the DPE Temp accounts work. Do we have a project tag for that work yet?