Page MenuHomePhabricator

MediaWiki Dumps XML - Provide attribute to indicate that user is temporary account in exported content
Open, Needs TriagePublic

Description

Context

Currently, exporting an article that has edits from a temporary account, there is no structured way to determine that the user account is a temporary account instead of a named, permanent one:

<revision>
      <contributor>
        <username>~2024-2796</username>
        <id>1881</id>
      </contributor>

Proposal

We could update the <contributor> property to include an <is_temp> field and set it to 1 if the user is a temp account, and 0 otherwise.

Consequences

  • Dump sizes become larger, one additional line for every user account referenced in an exported page
  • Structured mechanism for determining if a contributor of a revision is a temporary account

Event Timeline

kostajh renamed this task from Provide attrribute to indicate that user is temporary account in exported content to Provide attribute to indicate that user is temporary account in exported content.May 23 2024, 11:41 AM

For fun here's some context I have:

We ran into a similar issue when we were designing the page change event schema in T308017.
There is a larger issue about how to represent user types in MW too: T336176: Define MediaWiki user types

As noted in that ticket in T336176#8855477, we decided to go with boolean fields for this, even though we would have preferred something more flexible and comprehensive.

You can see how we are representing user types in the mediawiki/state/entity/user schema fragment.


All that is to say, adding a user_is_temp or is_temp boolean field to exported content makes sense to me. I don't know who would be responsible for implementing this though. Data Products and @xcollazo might know?

BTW, this is the first I've learned of the term 'contributor' (I haven't really used dumps myself).

That's too bad, because now there's yet another term for this concept.

https://wikitech.wikimedia.org/wiki/Data_modeling_guidelines#performer_vs._actor_vs._user

actor was added to Mediawiki core in 2018 to represent either a user or IP address in order to save space in large tables like logging and revision.

performer is used in the Event Platform to indicate the user performing the action. This was chosen because it was already in use in Mediawiki core, in LogEntry and RecentChange. It was a surprise to the designers that actor was already in use.

And, in revision's case, in events, a revision has an editor field, which I think maps to dump's contributor here.

I think we'll want to update Data Modeling Guidelines to document and disambiguate contributor. I'll do that soon when I work on resolving the outstanding comments about the guidlines.

BTW, this is the first I've learned of the term 'contributor' (I haven't really used dumps myself).

FWIW, that term goes back to at least 2015 (and probably 2003 according to the old Export.php.)

For fun here's some context I have:

We ran into a similar issue when we were designing the page change event schema in T308017.
There is a larger issue about how to represent user types in MW too: T336176: Define MediaWiki user types

As noted in that ticket in T336176#8855477, we decided to go with boolean fields for this, even though we would have preferred something more flexible and comprehensive.

You can see how we are representing user types in the mediawiki/state/entity/user schema fragment.


All that is to say, adding a user_is_temp or is_temp boolean field to exported content makes sense to me. I don't know who would be responsible for implementing this though. Data Products and @xcollazo might know?

We could definitely incorporate this into Dumps 2.0 once it is available in the user schema fragment (and thus part of https://schema.wikimedia.org/repositories/primary/jsonschema/mediawiki/page/change/current.yaml) .

once it is available in the user schema fragment

Oh, it's already there @xcollazo !

Ah, my bad. Ok then it should be easy to incorporate this into Dumps 2.0.

Considering that the work to add one boolean is the same to add N booleans, I wonder if the others would be interesting to a dumps consumer and not just is_temp:

is_bot:
  description: >
    True if this user is considered to be a bot at the time of this event.
    This is checked via the $user->isBot() method, which considers both
    user_groups and user permissions.
  type: boolean

is_system:
  description: >
    True if the user is a MediaWiki 'system' user. These are users that
    cannot 'authenticate'.  These are usually listed in ReservedUsernames.
  type: boolean

is_temp:
  description: >
    True if the user is an autocreated temporary MediaWiki user.
    This is used for IP masking.
  type: boolean

Additionally, we don't have one for is_anonymous, which is unfortunately codified in the dumps on the user_id in a very confusing way:

user_id                     BIGINT    COMMENT 'id of the user that made the revision; null if anonymous, zero if old system user, and -1 when deleted or malformed XML was imported',

I wonder if the others would be interesting to a dumps consumer

Sure, why not!

is_anonymous

I can't recall all the discussions, but I think anonymous is a bit of ambiguous term, especially with is_temp now. Comments in T336176: Define MediaWiki user types might have more details. There was a lot of discussion around this term with the IP masking / temp user project. Would be worth asking those folks if they have definitions documented somewhere.

Probably safest not to add this one for now.

@lbowmaker this should be included in the DPE Temp accounts work. Do we have a project tag for that work yet?

this task is about the XML format of export. This impacts not just dumps (1 or 2), it also impacts the Special:Export feature and the ability of MediaWiki to import XML exports. The comments get into naming conventions and I'll let others decide on those, but the short of it is that there are multiple names for multiple concepts overlapping in both directions, all related to "user". So it's messy. For importing to work, the XML content must include something that we can map to the user_is_temp property of the user table. I am not sure to what extent
@Niharika
’s team planned on changing the import/export PHP code to account for this. So, possible plan:

  1. check if work / thinking has been done to account for how we represent user_is_temp in the XML format and, later, how we import it back
  2. if not done, do that work. The relevant part of the schema is ContributorType, see here
  3. Release a new XSD schema, version 0.12, including the decisions above
  4. Update the PHP code that handles export and import
  5. Update XML-rendering code from Dumps 2 to conform to the new schema

These are all relatively straightforward, I expect the naming / schema work to be the hardest part. For that, I think this makes the most sense given the current schema:

<complexType name="ContributorType">
  <sequence>
    <element name="username" type="string" minOccurs="0"/>
    <element name="id" type="nonNegativeInteger" minOccurs="0"/>
    <element name="ip" type="string" minOccurs="0"/>
    <!-- new for temp accounts: a new element type is the only choice that keeps the abstraction consistent -->
    <element name="temp_id" type="nonNegativeInteger" minOccurs="0"/>
  </sequence>
  <!--  This allows deleted=deleted on non-empty elements, but XSD is not omnipotent  -->
  <attribute name="deleted" type="mw:DeletedFlagType"/>
</complexType>
Ottomata renamed this task from Provide attribute to indicate that user is temporary account in exported content to MediaWiki Dumps XML - Provide attribute to indicate that user is temporary account in exported content.Sep 24 2024, 2:50 PM

Can we please describe the use case for this?

Can we please describe the use case for this?

Researchers who use the XML dumps. Currently it's very easy to work out who is a not-logged in contributor (the name is an IP address) vs who has a permanent account. When temporary accounts are deployed, the only indicator that a username is temporary is the prefix (~2024-). Maybe that is enough for now, and we don't need to do anything else here.

If nothing is done here, then it means that the export has a username like <username>~2024-2796</username>. If someone wants to parse dump files to recognize an anonymous username, they'll have to look up the pattern configuration for the wiki. That seems niche enough to not be a blocker to minor pilot wiki deployment, so I am untagging minor pilot wiki blocker tag.

Can we please describe the use case for this?

Researchers who use the XML dumps. Currently it's very easy to work out who is a not-logged in contributor (the name is an IP address) vs who has a permanent account. When temporary accounts are deployed, the only indicator that a username is temporary is the prefix (~2024-). Maybe that is enough for now, and we don't need to do anything else here.

Couple follow ups:
IIRC, this prefix is a config and can be changed on a per wiki basis?

Also, will it track the year of the temp account creation like ~2024- suggests? So next year accounts would be ~2025-?

Can you please share a link to where can I go and learn about this config?

Can we please describe the use case for this?

Researchers who use the XML dumps. Currently it's very easy to work out who is a not-logged in contributor (the name is an IP address) vs who has a permanent account. When temporary accounts are deployed, the only indicator that a username is temporary is the prefix (~2024-). Maybe that is enough for now, and we don't need to do anything else here.

Couple follow ups:
IIRC, this prefix is a config and can be changed on a per wiki basis?

We will not adjust this per wiki.

Also, will it track the year of the temp account creation like ~2024- suggests? So next year accounts would be ~2025-?

Yes, that's right. The relevant config is $wgAutoCreateTempUser['serialProvider']['useYear'] = true;, see https://www.mediawiki.org/wiki/Manual:$wgAutoCreateTempUser

Can you please share a link to where can I go and learn about this config?

The epic is here T345760: [Epic] Temporary username format. @Niharika @sgrabarczuk is there a page where we've summarized the various factors that went into deciding the format, how it works, etc?

@kostajh so would we still consider this to be a blocker in general (just not for pilot wikis?)

If this is a blocker, then I'd like to point out that currently the work needed would be:

  1. decide on what the xml output should look like (new <temp> element?)
  2. create a new export xsd schema version that implements this decision
  3. update MW export functionality to use the new schema
  4. update import code
  5. update dumps 2 pipelines to pass, use, and export this information

I don't think this needs to block anything. If someone wants to analyze dumps to parse temp accounts separate from named users, they can do so by using the prefix (~2024- or just ~2).

If a consumer of dump data wants to have more structured output in the XML, then I hope we will one day hear from that consumer on this task. But until that happens, we could probably just leave this task open as a feature request, but not act on it.