Transwiki (within a farm) support for Flow dumps/imports
Open, Needs TriagePublic

Description

We have full export/import support for Flow, but it was designed more for right to fork (third-party wikis being able to restore the whole wiki) and offline analysis (research, etc.)

We did not consider the use case of a transwiki within a farm, and there is an issue for WMF (and any other farm that used $wgFlowCluster like WMF). All of the workflows are in the same database (except private wikis), and the import process preserves UUIDs. That means when you import a board that's already somewhere on the cluster, it will cause ID collisions.

I think the solution is actually not as hard as it sounds, though:

  • Add centraluserid to the export if this is an attached wiki (forget how to check that).
  • Add a FlowBackupReader subclass of BackupReader (currently we use the standard core import script/class). Add a --transwiki , which means that this is a transwiki (copy) between two wikis on the same farm (meaning CentralIdLookup works).
  • In transwiki mode, you don't import the corresponding core pages to the Flow board you're transwiki-ing. Instead, the transwiki mode creates them for you.
  • Whenever an ID is encountered, it is mapped to a new equivalent ID using HistoricalUIDGenerator. This mapping is preserved and reused.
  • In handleBoard and handleTopic, we also create the board and topic core pages using ensureFlowRevision (the same way we normally create them). (In the current importer, we only do this for boards, which is inconsistent. I don't think it's necessary for non-transwiki since we assume you're also doing a core import. If it were, we would also need it for topics.).
  • Map all the user names to local user IDs using CentralIdLookup.
Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptJan 7 2017, 12:03 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Mattflaschen-WMF edited the task description. (Show Details)
Mattflaschen-WMF changed the title from "Transwiki support for Flow dumps/imports" to "Transwiki (within a farm) support for Flow dumps/imports".
Mattflaschen-WMF edited the task description. (Show Details)
Mattflaschen-WMF added a subscriber: matthiasmullie.
TTO added a subscriber: TTO.Jan 7 2017, 3:30 AM

@Mattflaschen-WMF, could you confirm or correct my understanding of the different steps of your plan?

Add a FlowBackupReader subclass of BackupReader (currently we use the standard core import script/class). Add a --transwiki , which means that this is a transwiki (copy) between two wikis on the same farm (meaning CentralIdLookup works).

We create our own Flow/maintenance/importDump.php and it's the entry point for transwiki imports?

In transwiki mode, you don't import the corresponding core pages to the Flow board you're transwiki-ing. Instead, the transwiki mode creates them for you.

If I understand correctly, in normal mode, we don't create the core pages because we assume a core import is running at the same time but in transwiki mode, we're only importing Flow revisions so we should create core pages using ensureFlowRevision().

@Mattflaschen-WMF, could you confirm or correct my understanding of the different steps of your plan?

Add a FlowBackupReader subclass of BackupReader (currently we use the standard core import script/class). Add a --transwiki , which means that this is a transwiki (copy) between two wikis on the same farm (meaning CentralIdLookup works).

We create our own Flow/maintenance/importDump.php and it's the entry point for transwiki imports?

Entry point for all Flow imports, with a --transwiki option. Subclassing BackupReader from core.

In transwiki mode, you don't import the corresponding core pages to the Flow board you're transwiki-ing. Instead, the transwiki mode creates them for you.

If I understand correctly, in normal mode, we don't create the core pages because we assume a core import is running at the same time but in transwiki mode, we're only importing Flow revisions so we should create core pages using ensureFlowRevision().

Yep, that's what I meant.

I was thinking a little more about whether "I don't think it's necessary for non-transwiki since we assume you're also doing a core import. If it were, we would also need it for topics." is accurate (maybe it's done for topics by other code in the current importer flow). But, I think it is probably right. I haven't double-checked, but I would expect the normal ensureFlowRevision logic for e.g. adding a post is higher-level than the dump import logic (thus inapplicable)

Anyway, for transwiki, it's important the core pages are both created with the right workflow IDs (the post-HistoricalUIDGenerator-mapping ones).

Change 337895 had a related patch set uploaded (by Sbisson):
Import dump: support importing a board that exist in the farm

https://gerrit.wikimedia.org/r/337895

What do we want to do with contributions from users that don't exist in the target wiki?

With the version I have now, they are imported but the name of the missing user is simply not shown on the board, topic or history pages until he is created locally.

In the following screenshot, LocalUser1 exist on the source but not the target wiki because he's never logged in there.

However, his name and contributions do show up correctly after he logs in once on the target wiki. Here's the same screenshot after he's logged in.

Change 337895 merged by jenkins-bot:
Import dump: support importing a board that exist in the farm

https://gerrit.wikimedia.org/r/337895

Change 339116 had a related patch set uploaded (by Mattflaschen):
Import dump: support importing a board that exist in the farm

https://gerrit.wikimedia.org/r/339116

Change 339116 merged by jenkins-bot:
Import dump: support importing a board that exist in the farm

https://gerrit.wikimedia.org/r/339116

Mentioned in SAL (#wikimedia-operations) [2017-02-22T19:37:48Z] <thcipriani@tin> Synchronized php-1.29.0-wmf.13/extensions/Flow: SWAT: [[gerrit:339116|Import dump: support importing a board that exist in the farm]] T154830 (duration: 00m 56s)

@Mattflaschen-WMF
Is it supposed to work for importing Flow talk boards between betalabs wikis and normal wikis? The import of non-Flow pages is successful.

Steps:

  1. On enwiki betalabs on Special:Import select testwiki as 'Source wiki'
  2. In 'Source page' enter Flow board name that does not exist in enwiki labs. The options 'Copy all history revisions for this page' and 'Import to default locations' are selected.
  3. The import is not successful, the following is displayed:

The page with the imported page name is created, but 'Topic uknown' is displayed.

Okay, first, this task is about importing within a farm (meaning Beta->Beta or production->production).

We can explore other stuff/check for regressions, but that's the primary scope of the task.

The Flow export must be done using Flow/maintenance/dumpBackup.php. If you're not importing to the same farm, I think you'll also need to do a core export (mediawiki/maintenance/dumpBackup.php). For both, you can specify a page list (which could be a single page), with the --pagelist option.

And we're been using the command-line import script mainly (mediawiki/maintenance/importDump.php). Special:Import probably should work with "Upload XML data", but I don't know if/how much it's been tested.

@Mattflaschen-WMF
Is it supposed to work for importing Flow talk boards between betalabs wikis and normal wikis?

How are you trying to import from normal wikis? I believe "Import from another wiki" refers to other wikis on the same farm.

But per above, you should be using Flow/maintenance/dumpBackup.php, not "Import from another wiki"