Investigate possible issues with imported user names for FileImporter
Closed, ResolvedPublic3 Story Points

Description

As discussed on the wikitech-l mailinglist, there are issues with imported user names. Investigate how this affects the file importer (If at all)

Right now we simply pass in the username and expect mediawiki to assign the correct uid to the edit.


From: "Brad Jorsch (Anomie)"
Date: 31 Oct 2017 2:52 pm
Subject: [Wikitech-l] Proposal regarding the handling of imported usernames
To: "Wikimedia developers" <wikitech-l@lists.wikimedia.org>
Cc:

Handling of usernames in imported edits in MediaWiki has long been weird
(T9240[1] was filed in 2006!).

If the local user doesn't exist, we get a strange row in the revision table
where rev_user_text refers to a valid name while rev_user is 0 which
typically indicates an IP edit. Someone can later create the name, but
rev_user remains 0, so depending on which field a tool looks at the
revision may or may not be considered to actually belong to the
newly-created user.

If the local user does exist when the import is done, the edit is
attributed to that user regardless of whether it's actually the same user.
See T179246[2] for an example where imported edits got attributed to the
wrong account in pre-SUL times.

In Gerrit change 386625[3] I propose to change that.

   - If revisions are imported using the "Upload XML data" method, it will
   be required to fill in a new field to indicate the source of the edits,
   which is intended to be interpreted as an interwiki prefix.
   - If revisions are imported using the."Import from another wiki" method,
   the specified source wiki will be used as the source.
   - During the import, any usernames that don't exist locally (and can't
   be auto-created via CentralAuth[4]) will be imported as an
   otherwise-invalid name, e.g. an edit by User:Example from source 'en' would
   be imported as "en>Example".[5]
   - There will be a checkbox on Special:Import to specify whether the same
   should be done for usernames that do exist locally (or can be created) or
   whether those edits should be attributed to the existing/autocreated local
   user.
   - On history pages, log pages, and the like, these usernames will be
   displayed as interwiki links, much as might be generated by wikitext like "
   [[:en:User:Example|en>Example]]". No parenthesized 'tool' links (talk,
   block, and so on) will be generated for these rows.
   - On WMF wikis, we'll run a maintenance script to clean up the existing
   rows with valid usernames and rev_user = 0. The current plan there is to
   attribute these edits to existing SUL users where possible and to prefix
   them with a generic prefix otherwise, but we could as easily prefix them
   all.
      - Unfortunately it's impossible to retroactively determine the actual
      source of old imports automatically or to automatically do anything about
      imports that were misattributed to a different local user in
pre-SUL times
      (e.g. T179246[2]).
      - The same will be done for CentralAuth's global suppression blocks.
   In this case, on WMF wikis we can safely point them all at Meta.

If you have comments on this proposal, please reply here or on
https://gerrit.wikimedia.org/r/#/c/386625/.


Background: The upcoming actor table changes[6] require some change to the
handling of these imported names because we can't have separate attribution
to "Example as a non-registered user" and "Example as a registered user"
with the new schema. The options we've identified are:

   1. This proposal, or something much like it.
   2. All the existing rows with rev_user = 0 would have to be attributed
   to the existing local user (if any), and in the future when a new user is
   created any existing edits attributed to that name will be automatically
   attributed to that new account.
   3. All the existing rows with rev_user = 0 and an existing local user
   would have to be re-attributed to different *valid* usernames, probably
   randomly-generated in some manner, and in the future when a new user is
   created any existing edits for that name would have to be similarly
   re-attributed.
   4. Like #2, except the creation (including SUL auto-creation) of the
   same-named account would not be allowed. Thus, an import before the local
   name exists would forever block that name from being used for an actual
   local account.
   5. Some less consistent combination of the "all the existing rows" and
   "when a new user is created" options from #2–4.

Of these options, this proposal seems like the best one.

[1]: https://phabricator.wikimedia.org/T9240
[2]: https://phabricator.wikimedia.org/T179246
[3]: https://gerrit.wikimedia.org/r/#/c/386625/
[4]: https://phabricator.wikimedia.org/T111605
[5]: ">" was chosen rather than the more typical ":" because the former is
already invalid in all usernames (and page titles). While a colon is *now*
disallowed in new usernames, existing names created before that restriction
was added can continue to be used (and there are over 12000 such usernames
in WMF's SUL) and we decided it'd be better not to suddenly break them.
[6]: https://phabricator.wikimedia.org/T167246
Lea_WMDE created this task.Nov 14 2017, 2:06 PM
Restricted Application added a project: TCB-Team. · View Herald TranscriptNov 14 2017, 2:06 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We haven't checked this in detail but we should do it.

Addshore updated the task description. (Show Details)Feb 13 2018, 3:25 PM
Lea_WMDE triaged this task as Normal priority.Feb 27 2018, 3:34 PM
Lea_WMDE set the point value for this task to 3.
Lea_WMDE moved this task from Todo to Sprint ready on the WMDE-QWERTY-Team board.Mar 6 2018, 11:36 AM
Addshore renamed this task from Investigate issue with imported user names for file importer to Investigate possible issues with imported user names for FileImporter.Mar 12 2018, 11:52 AM
Lea_WMDE closed this task as Resolved.Mar 20 2018, 4:22 PM
Lea_WMDE claimed this task.
Lea_WMDE reopened this task as Open.Mar 20 2018, 4:25 PM
Lea_WMDE removed Lea_WMDE as the assignee of this task.

As seen in the ticket one should probably have a look at the merged patch to address the interwiki username issues:
https://gerrit.wikimedia.org/r/#/c/386625/

WMDE-Fisch moved this task from Sprint Backlog to Doing on the WMDE-QWERTY-Sprint-2018-03-20 board.

So looking at the patches, code and the discussion in the tickets there is a "new" way to handle usernames on import.

The ExternalUserNames class takes care of everything related to processing "external" usernames on imports. Depending on the settings of that class, that would mean:

  • Validating the username string in general.
  • Checking if a user by that name exists on the target wikki.
  • If set so, using that name as local user assuming it is a CentralAuth managed account
  • If set so, creating that user locally with CentralAuth
  • If set so, applying a preset prefix to the username that could for example hint to the source wiki

The Linker class when parsing usernames can utilize a prefix, if set, so the link would then point to the userprofile on the (known) source wiki. E.g. [[de:User:Christoph Jauera (WMDE)]]

We should implement the external user name handling into the FileImporter. We could decide on either setting a prefix to the usernames pointing to the profiles on the source wiki, or, since we are in a CentralAuth world here with SUL, make usage of applying existing accounts / creating them on the fly.

Lea_WMDE closed this task as Resolved.Wed, Apr 4, 10:03 AM