Page MenuHomePhabricator

XmlDumpWriter::openPage handles main namespace articles with prefixes that are namespace names AND are redirects incorrectly
Open, HighPublic0 Estimated Story Points

Description

This breaks stubs dumps for a number of wikis.

A tale of two pages:
Page 1 is 8821 on sa wikisource, and here is its info:

`
wikiadmin@10.64.48.35(sawikisource)> select page_id, page_namespace, page_title, page_latest, page_is_redirect from page where page_id = 8821;
+---------+----------------+--------------------------------------------+-------------+------------------+
| page_id | page_namespace | page_title                                 | page_latest | page_is_redirect |
+---------+----------------+--------------------------------------------+-------------+------------------+
|    8821 |            104 | Kumarasambhavam_-_Mallinatha_-_1888.djvu/5 |      157361 |                0 |
+---------+----------------+--------------------------------------------+-------------+------------------+

Page 2 is 8829 on sa wikisource and here is its info:

wikiadmin@10.64.48.35(sawikisource)> select page_id, page_namespace, page_title, page_latest, page_is_redirect from page where page_id = 8829;
+---------+----------------+------------------------------------------------------------------+-------------+------------------+
| page_id | page_namespace | page_title                                                       | page_latest | page_is_redirect |
+---------+----------------+------------------------------------------------------------------+-------------+------------------+
|    8829 |              0 | पृष्ठम्:Kumarasambhavam_-_Mallinatha_-_1888.djvu/5               |       28418 |                1 |
+---------+----------------+------------------------------------------------------------------+-------------+------------------+
1 row in set (0.00 sec)

Note that the prefix पृष्ठम् is in fact the name of namespace 104, you can check it yourself by looking at https://sa.wikisource.org/w/api.php?action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Cstatistics and json decoding the string. (Or maybe there's a faster way.)

When we dump the stubs, we wind up grabbing a number of pages in a batch rather than asking the db for each one separately. Speed and all that. In the current case we ask for a batch that includes the range 8821 through 8829; these all have very few revisions so it's no burden for the servers. BUT...

We process all the revisions up to the first one for page 8829. Items go into the link cache during processing.

Now we start work on page 8829:

  • openPage makes a Title from the selected row.
  • If the page is a redirect, it will get a WikiPage object for the Title object and then call that class's getRedirectTarget() .
  • This method first checks to see if the page is a redirect, logical enough: if ( !$this->mTitle->isRedirect() ) {
  • Title->isRedirect() calls getArticleID() on the Title object, with an argument of 0.
  • Because at this point the article ID has not been set in the Title Object, we wind up at $this->mArticleID = $linkCache->addLinkObj( $this );
  • We're going to look up the info in the link cache. Guess what the key is: पृष्ठम्:Kumarasambhavam_-_Mallinatha_-_1888.djvu/5
  • And guess what page id it has: 8821.

BOOM.

This gets caught in the constructor for RevisionStoreRecord.php whch gets passed a revision row for page id 8829, with a title claiming now to be for page id 8821.

[7381d9d77b7aef61403caffe] [no req]   InvalidArgumentException from line 100 of /srv/mediawiki_atg/php-1.33.0-wmf.23/includes/Revision/RevisionStoreRecord.php: The given Title does not belong to page ID 8829 but actually belongs to 8821
Backtrace:
#0 /srv/mediawiki_atg/php-1.33.0-wmf.23/includes/Revision/RevisionStore.php(1820): MediaWiki\Revision\RevisionStoreRecord->__construct(Title, User, CommentStoreComment, stdClass, MediaWiki\Revision\RevisionSlots, boolean)
#1 /srv/mediawiki_atg/php-1.33.0-wmf.23/includes/export/XmlDumpWriter.php(332): MediaWiki\Revision\RevisionStore->newRevisionFromRow(stdClass, integer, Title)
#2 /srv/mediawiki_atg/php-1.33.0-wmf.23/includes/export/WikiExporter.php(485): XmlDumpWriter->writeRevision(stdClass)
#3 /srv/mediawiki_atg/php-1.33.0-wmf.23/includes/export/WikiExporter.php(445): WikiExporter->outputPageStreamBatch(Wikimedia\Rdbms\ResultWrapper, stdClass)
#4 /srv/mediawiki_atg/php-1.33.0-wmf.23/includes/export/WikiExporter.php(269): WikiExporter->dumpPages(string, boolean)
#5 /srv/mediawiki_atg/php-1.33.0-wmf.23/includes/export/WikiExporter.php(154): WikiExporter->dumpFrom(string, boolean)
#6 /srv/mediawiki_atg/php-1.33.0-wmf.23/maintenance/includes/BackupDumper.php(288): WikiExporter->pagesByRange(integer, integer, boolean)
#7 /srv/mediawiki_atg/php-1.33.0-wmf.23/maintenance/dumpBackup.php(81): BackupDumper->dump(integer, integer)
#8 /srv/mediawiki_atg/php-1.33.0-wmf.23/maintenance/doMaintenance.php(94): DumpBackup->execute()
#9 /srv/mediawiki_atg/php-1.33.0-wmf.23/maintenance/dumpBackup.php(138): require_once(string)
#10 /srv/mediawiki_atg/multiversion/MWScript.php(100): require_once(string)
#11 {main}

I have no idea what the right fix is.

Event Timeline

ArielGlenn created this task.

Note that in the 20190301 sa wiksource stubs, the revisions for pages 8821 and 8829 appear normally, with the revision for 8829 marked as a redirect, as it should be, and the title of the redirect target provided.

Since the underlying issue is that twp pages have the same key in the link cache, there's a few things that might be done:

  • run the namespaceDupes.php mainenance script and rename the dead pages; this may not always produce the desired result, as those redirects will sit at page titles that no one would ever visit
  • delete the inaccessible pages manually - can this even be done, if the pages can't be reached?
  • for making the stubs work, XmlDumpWriter::closePage could remove the page it has just processed (in $this->currentTitle) from the link cache, if that's not very expensive; I'll look in to this option today.

Change 502150 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] avoid link cache issues with duplicate title keys for xml dumps

https://gerrit.wikimedia.org/r/502150

I've tested this just now on the sa wikisource dumps and the stubs run past the problematic page. If they run to completion properly and the output looks good compared to the last month's full run, I'll rerun all of the problem wikis with this live-patched so we can get these jobs done.

A comparison of stubs for all revisions from 20190320 and the ones run just now shows only new revisions, aside from newly imported pages (which add up to the correct number of revisions for those) and a redirect change reflected in an edit during the last couple weeks.

I'll start running the other unhappy wikis' stubs so we can get them done.

There should probably be a separate bug for making LinkCache do the right thing. Your workaround loos like the right thing for this instance of the issue.

The other bug (T220424) is more or less that bug.

I have live-patched this on snapsht1007 for .wmf24, so that stubs of commonswiki (the one outstanding job left) can run to completion.

This also broke abstracts on en wiki so I'm live patching on snapshot1009 for that. Maybe we can get a backport before today's deploy.

Change 502150 merged by jenkins-bot:
[mediawiki/core@master] avoid link cache issues with duplicate title keys for xml dumps

https://gerrit.wikimedia.org/r/502150

Change 502538 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@wmf/1.33.0-wmf.24] avoid link cache issues with duplicate title keys for xml dumps

https://gerrit.wikimedia.org/r/502538

Change 502538 merged by jenkins-bot:
[mediawiki/core@wmf/1.33.0-wmf.24] avoid link cache issues with duplicate title keys for xml dumps

https://gerrit.wikimedia.org/r/502538