Page MenuHomePhabricator

Guarantee that FileImporter imported pages are discoverable after a failure
Closed, ResolvedPublic2 Estimated Story Points

Description

When the import fails, we're probably left with a corrupted file page, either having the text revisions but no file, or a partial history. This file may not be properly marked as produced by FileImporter, since we do that in the final or penultimate step. We should try to delete the bad page after a failed import, or otherwise mark for deletion if the importing user doesn't have the rights to delete.

If we tag every imported revision, it will be possible to detect incomplete imports, as well as build a comprehensive list of all edits made by FileImporter on the target wiki.

Acceptance criteria
[] Always delete all zombie file page revisions ( and related file revisions ) that were created by failed FileImporter after a failed imports

  • Alternatively add a template (or tag) to find all imports, and make it possible to mark zombie files for deletion

Demo instructions

  • Perform an import on the beta cluster.
  • Look at the imported's page's history, to verify that the earliest imported revision has a FileImporter tag.

Follow-up

  • Write a maintenance script which cleans up zombie revisions.

Event Timeline

As far as I know, we never produce any zombie files (IIRC that was what @Addshore said), but we can double check if it looks like the opposite?

This was something I noticed happening on the beta cluster, so I'm sure that it's possible. I should have documented better, though... because of the fact that the resulting files aren't tagged with FileImporter, it'll be difficult to find these in production.

So, the last state that this was in was:

https://github.com/wikimedia/mediawiki-extensions-FileImporter/blob/master/src/Services/Importer.php#L197-L200

		/**
		 * Text revisions should be added first. See T147451.
		 * This ensures that the page entry is created and if something fails it can thus be deleted.
		 */

In an ideal world we would be able to rollback the cross storage writes as was tried in T147451.

This was something I noticed happening on the beta cluster, so I'm sure that it's possible. I should have documented better, though... because of the fact that the resulting files aren't tagged with FileImporter, it'll be difficult to find these in production.

Perhaps the individual revisions that are imported could have a tag added to them?
Even some server side logging would probably help a bit here

A few thoughts that came up during discussion:

  • Could create an artificial first revision with content like [[Category:FileImporter import in progress]], and a page property like "FileImporter in progress". Then we import revisions on top of this. When finished successfully, remove the page prop. We don't know what the implications might be of having a first revision timestamp later than the successive revisions.
  • Could import to a private namespace with restrictive permissions, then move to the File namespace when complete.

Side topic that should be considered:

  • We already have a race condition where users might edit a file during a long import process, and their edits would be overwritten.
awight set the point value for this task to 8.Aug 21 2019, 1:32 PM
awight renamed this task from Remove zombie files after failed import to Guarantee that FileImporter imported pages are discoverable after a failure.Aug 22 2019, 9:16 AM

I tried to use change tags to mark each imported text revision, but support is lacking to pass these through during revision creation. With some mw-core changes to ImportableOldRevisionImporter and ImportableOldRevision we would be able to do this.

Change 531693 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/FileImporter@master] [WIP] Add change tags to all FileImport text revisions

https://gerrit.wikimedia.org/r/531693

Change 531694 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/core@master] Include change tags in revision import structure

https://gerrit.wikimedia.org/r/531694

awight changed the point value for this task from 8 to 5.Sep 3 2019, 12:23 PM
awight changed the point value for this task from 5 to 2.Sep 3 2019, 12:26 PM

The acceptance criteria might need to change for this task. By definition, we can't reliably mark truncated imports *during* a failure, but only afterwards. This would be the maintenance script which can wait until future work.

Proposed goal:

  • Mark imported revisions in a way that we can find failed imports later.

Change 531694 merged by jenkins-bot:
[mediawiki/core@master] Include change tags in revision import structure

https://gerrit.wikimedia.org/r/531694

Change 534620 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/core@master] Rollback change_tag table for tests

https://gerrit.wikimedia.org/r/534620

Change 534620 merged by jenkins-bot:
[mediawiki/core@master] Rollback change_tag table for tests

https://gerrit.wikimedia.org/r/534620

I've been gaslighting myself, pointing out that the orphaned file may have been entirely imagined. But this may be the "smoking gun" after all: T227780#5324626 . Here's the broken file and its history, which shows that all file revisions are missing and there is no fileimporter tag present.

There might be a much easier way to get 80% of what we need. @Andrew-WMDE pointed out that we always import text revisions before file revisions, so if the import breaks before any file revisions have been copied (which is true for the one example file I found above), the following query will identify the partial imports:

select page_title
from page left
join image
  on img_name=page_title
where
  img_name is null
  and page_namespace=6;

On beta commons, this finds 240 broken images, for example: https://commons.wikimedia.beta.wmflabs.org/wiki/File:File:File:Art.jpeg

When the import breaks after file revisions have been added, I think we might still need the atomic tagging patch.

I'm still bisecting the problem, but I've found that removing logging from the $tablesUsed assignment in SpecialCheckUserTest causes our tests to pass.

To investigate why this breaks our test, I've tried the following, none of which fix the issue:

  • ChangeTags::defineTag( 'fileimporter-imported' ); in FileImporter's ImporterTest.
  • Alter core to rollback change_tag whenever logging is present in $tablesUsed, like we do for page-related and user-related tables.
  • Move $tablesUsed assignment from SpecialCheckUserTest's constructor into more common setUp function.
  • Removing all tests from SpecialCheckUserTest, leaving just the $tablesUsed assignment.

Change 536215 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/core@master] Reset all logging tables together

https://gerrit.wikimedia.org/r/536215

Change 536215 merged by jenkins-bot:
[mediawiki/core@master] Reset all logging tables together

https://gerrit.wikimedia.org/r/536215

This needs to be deployed Thursday, 19 Sept in order to burn-in before the default deployment.

Note what the patch https://gerrit.wikimedia.org/r/531693 does in addition is importing tags for all text revisions as they have been present on the source wiki. For example, bot edits that have been marked as such on the source wiki will still be marked as such on Wikimedia Commons. This was not done before. Personally, I think this is an improvement. I'm just not sure if not importing tags was a deliberate decision back then? Or if it just wasn't possible at the time? (It actually wasn't, @awight's efforts made it possible now!) Does anyone remember? @Tobi_WMDE_SW? @JStrodt_WMDE?

I'm just not sure if not importing tags was a deliberate decision back then?

One problem I hadn't thought of is that, even if the tags are successfully attached to imported revisions, they will certainly be missing the descriptive text shown on Special:Tags, and the translatable short code. In fact, they can't be displayed because of this! For example, this diff actually has two change-tags:

https://commons.wikimedia.org/wiki/?diff=182510496

[commonswiki]> select * from change_tag where ct_rev_id=182510496;
+---------+-----------+-----------+-----------+-----------+-----------+
| ct_id   | ct_rc_id  | ct_log_id | ct_rev_id | ct_params | ct_tag_id |
+---------+-----------+-----------+-----------+-----------+-----------+
| 2707929 | 189395690 | 146458961 | 182510496 | NULL      |        17 |
| 2707930 | 189395690 | 146458961 | 182510496 | NULL      |       154 |
+---------+-----------+-----------+-----------+-----------+-----------+

What I'd like to do is to split out this bit of the patch into a follow-up.

Change 531693 merged by jenkins-bot:
[mediawiki/extensions/FileImporter@master] Add change tags to all FileImport text revisions

https://gerrit.wikimedia.org/r/531693

Okay, I see this in code review:

It might be that we want to drop this again later, if it turns out this creates more issues than it solves. I think the chance for this to happen is rare, and the benefits are worth the risk. Let's do it.

I agree with this assessment. At least, this way we're preserving some data even if it won't render yet.

this way we're preserving some data even if it won't render yet.

👍

Change 538566 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/FileImporter@wmf/1.34.0-wmf.23] Add change tags to all FileImport text revisions

https://gerrit.wikimedia.org/r/538566

Change 538566 merged by jenkins-bot:
[mediawiki/extensions/FileImporter@wmf/1.34.0-wmf.23] Add change tags to all FileImport text revisions

https://gerrit.wikimedia.org/r/538566

Mentioned in SAL (#wikimedia-operations) [2019-09-23T11:31:50Z] <awight@deploy1001> Synchronized php-1.34.0-wmf.23/extensions/FileImporter: SWAT: [[gerrit:538566|Add change tags to all FileImport text revisions (T227849)]] (duration: 00m 57s)

@JStrodt_WMDE I noticed something that might be worth adding to the documentation, that "Imported with FileImporter" revisions will never appear in Special:RecentChanges because they're historical and therefore not considered "recent". Minor but possibly surprising.

@awight Thanks. So would that mean that RecentChanges will only allow filtering for "Modified with FileImporter"?