Page MenuHomePhabricator

undelete sometimes leaves the association between pages and their revisions in a strange state
Open, Needs TriagePublic

Description

Copying over here the relevant info from T286877 since the impact on the dumps described in that task is fixed.

Event Timeline

Copied from T286877#7220197 to save people having to click around:

The page in question, "Stockholm Business School", was restored on July 18, see https://sv.wikipedia.org/w/index.php?title=Stockholm_Business_School&action=history&uselang=en
The entry in the page table for this title:

wikiadmin@10.64.0.99(svwiki)> select * from page where page_namespace=0 and page_title = "Stockholm_Business_School";
+---------+----------------+---------------------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| page_id | page_namespace | page_title                | page_restrictions | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_content_model | page_lang |
+---------+----------------+---------------------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| 3872662 |              0 | Stockholm_Business_School |                   |                0 |           0 | 0.207856671786 | 20210718185812 | 20210718185812     |    49457568 |     7354 | wikitext           | NULL      |
+---------+----------------+---------------------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
1 row in set (0.001 sec)

However there are entries in the archive table for this same title:

wikiadmin@10.64.0.99(svwiki)> select * from archive where ar_namespace=0 and  ar_title = "Stockholm_Business_School";
+---------+--------------+---------------------------+---------------+----------+----------------+---------------+-----------+------------+--------+------------+--------------+---------------------------------+
| ar_id   | ar_namespace | ar_title                  | ar_comment_id | ar_actor | ar_timestamp   | ar_minor_edit | ar_rev_id | ar_deleted | ar_len | ar_page_id | ar_parent_id | ar_sha1                         |
+---------+--------------+---------------------------+---------------+----------+----------------+---------------+-----------+------------+--------+------------+--------------+---------------------------------+
| 5215947 |            0 | Stockholm_Business_School |      23822663 |     1136 | 20210717092556 |             0 |  49453482 |          0 |     78 |    8579229 |            0 | i5aq79phzojjd2mn019iqikye1cns89 |
| 5217451 |            0 | Stockholm_Business_School |      23827934 |     1136 | 20210718100730 |             0 |  49456280 |          0 |     75 |    8579418 |            0 | ilxrsk5zyft2zwhht95kw234yzgxced |
+---------+--------------+---------------------------+---------------+----------+----------------+---------------+-----------+------------+--------+------------+--------------+---------------------------------+
2 rows in set (0.001 sec)

When it comes time to try to iterate over the revisions, the wrong page id must have been chosen, as I see it has been written out by XmlDumpWriter:

--<snip--
...
  <page>
    <title>Dubai Millennium</title>
    <ns>0</ns>
    <id>8579417</id>
    <revision>
      <id>49456257</id>
      <timestamp>2021-07-18T09:59:18Z</timestamp>
      <contributor>
        <username>Kuriosatempel</username>
        <id>111748</id>
      </contributor>
      <comment>[[Wikipedia:Automatisk sammanfattning|←]]Skapade sidan med '{{Häst | namn = Dubai Millennium | bild =  | bildtext =  | kön = [[Hingst]] | född = {{Sportdatum|1996|3|2}} | födelseland = Storbritannien | död = {{Hästdöd datum och ålder|2001|04|29|1996|03|02}} | död_land = Storbritannien | färg = [[Hästfärg|Brun]] | tecken =  | ras = [[Engelskt fullblod]] | sport = Galoppsport | aktiv = 1998–2000 | efter = [[Seeking The Gold]] | under = [[Colorado Dancer]] | underefter = Sha...'</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="10912" id="49550759" />
      <sha1>p0stf8dyt26b3ipsp171jtwt7smlzkh</sha1>
    </revision>
  </page>
  <
    <title>Stockholm Business School</title>
    <ns>0</ns>
    <id>8579418</id>

That page id isn't in the page table of course:

wikiadmin@10.64.0.99(svwiki)> select * from page where page_id=8579418;
Empty set (0.001 sec)

Naturally nothing good is going to come from that. So my question is, why are those rows still in the archive table after page restoration, and how does the page id from those rows wind up being the unlucky id chosen for use?