Page MenuHomePhabricator

Not all content is getting replicated to wikitech-static
Closed, ResolvedPublic

Description

The SAL is up to date, so syncing must be happening. And yet, not everything is up there. Contrast

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS

with

https://wikitech-static.wikimedia.org/wiki/Portal:Cloud_VPS

We must not be making a complete dump.

Event Timeline

Wikitech is dumped using

/usr/local/bin/mwscript maintenance/dumpBackup.php labswiki --current --uploads

My guess is that --current is a bit broken, or doesn't do what I expected. Like, maybe it doesn't detect moved pages properly.

A --current dump is 8.6M, a --full dump is 7.2G. So doing --full may not be practical.

The --current flag ends up adding this condition to the query to gather content:

$join['revision'] = [ 'INNER JOIN', 'page_id=rev_page AND page_latest=rev_id' ];

The full SQL query generated looks something like this (from my local mw-vagrant wiki):

SELECT /* WikiExporter::dumpFrom www-data@mediaw... */  /*! STRAIGHT_JOIN */ *,
  rev_comment AS `rev_comment_text`,
  NULL AS `rev_comment_data`,
  NULL AS `rev_comment_cid`
FROM `page` FORCE INDEX (PRIMARY)
INNER JOIN `revision` ON ((page_id=rev_page AND page_latest=rev_id))
INNER JOIN `text` ON ((rev_text_id=old_id))
ORDER BY page_id ASC

In the labswiki db, Portal:Toolforge has page_id = 46350. The query above returns data for that page_id:

(wikiadmin@silver) [labswiki]> SELECT /* WikiExporter::dumpFrom www-data@mediaw... */  /*! STRAIGHT_JOIN */ *,
    ->   rev_comment AS `rev_comment_text`,
    ->   NULL AS `rev_comment_data`,
    ->   NULL AS `rev_comment_cid`
    -> FROM `page` FORCE INDEX (PRIMARY)
    -> INNER JOIN `revision` ON ((page_id=rev_page AND page_latest=rev_id))
    -> INNER JOIN `text` ON ((rev_text_id=old_id))
    -> WHERE page_id = 46350
    -> ORDER BY page_id ASC\G
*************************** 1. row ***************************
           page_id: 46350
    page_namespace: 0
        page_title: Portal:Toolforge
 page_restrictions:
      page_counter: 0
  page_is_redirect: 0
       page_is_new: 0
       page_random: 0.455464647722
      page_touched: 20170928225244
       page_latest: 1771421
          page_len: 1098
page_content_model: wikitext
page_links_updated: 20170928225302
         page_lang: NULL
            rev_id: 1771421
          rev_page: 46350
       rev_text_id: 1768371
       rev_comment: tool-labs -> toolforge
          rev_user: 1604
     rev_user_text: BryanDavis
     rev_timestamp: 20170928225244
    rev_minor_edit: 0
       rev_deleted: 0
           rev_len: 1098
     rev_parent_id: 1767997
          rev_sha1: sircb7sin9e6wc97zgcph2oqq6acj5y
rev_content_format: NULL
 rev_content_model: NULL
            old_id: 1768371
          old_text: eSn
                       ]rh
$:g"Vbh'c]gegUz-d3`>Tr-@>P*a&71SU5yH]Aյ5yz.<Qktz+&jgLKzSCܪMrh1]@5265*N㝖jㅁ|7'ՍCr8ѽ6WϋfM(2{z}
         old_flags: utf-8,gzip
  rev_comment_text: tool-labs -> toolforge
  rev_comment_data: NULL
   rev_comment_cid: NULL
1 row in set (0.00 sec)

That revision is on wikitech (https://wikitech.wikimedia.org/w/index.php?title=Portal:Toolforge&oldid=1771421), but it is not on wikitech-static (https://wikitech-static.wikimedia.org/w/index.php?title=Portal:Toolforge&oldid=1771421).

I made a dump to see if the page content is in the dump file:

$ mwscript maintenance/dumpBackup.php labswiki --current > test-dump.xml
$ ls -alh test-dump.xml
-rw-rw-r-- 1 bd808 wikidev 39M Oct  5 17:01 test-dump.xml
$ grep '<title>Portal:Toolforge' test-dump.xml
    <title>Portal:Toolforge/Admin/Archive</title>
    <title>Portal:Toolforge/Admin/replagstats</title>
    <title>Portal:Toolforge/Admin/toolhistory</title>
    <title>Portal:Toolforge/Admin</title>
    <title>Portal:Toolforge/Admin/emergency guides/single tool webservice</title>
    <title>Portal:Toolforge/Admin/emergency guides/irc bot deployment</title>
    <title>Portal:Toolforge/Admin/emergency guides</title>
    <title>Portal:Toolforge/Admin/emergency guides/toolforge down notification</title>
    <title>Portal:Toolforge/Admin/new exec host</title>
    <title>Portal:Toolforge</title>
    <title>Portal:Toolforge/Admin/Deploy new jobutils package</title>
    <title>Portal:Toolforge/Admin/Kubernetes</title>
    <title>Portal:Toolforge/Admin/local packages</title>
    <title>Portal:Toolforge/Admin/BotLicensing</title>
    <title>Portal:Toolforge/Admin/emergency guides/labs down notification</title>
    <title>Portal:Toolforge/Admin/</title>
    <title>Portal:Toolforge/Admin/Exim</title>
  <page>
    <title>Portal:Toolforge</title>
    <ns>0</ns>
    <id>46350</id>
    <revision>
      <id>1771421</id>
      <parentid>1767997</parentid>
      <timestamp>2017-09-28T22:52:44Z</timestamp>
      <contributor>
        <username>BryanDavis</username>
        <id>1604</id>
      </contributor>
      <comment>tool-labs -&gt; toolforge</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve" bytes="1098">{{Template:Toolforge nav}}
{{...snip...}}
&lt;noinclude&gt;__NOTOC__ [[Category:Portals|Toolforge]]&lt;/noinclude&gt;</text>
      <sha1>sircb7sin9e6wc97zgcph2oqq6acj5y</sha1>
    </revision>
  </page>

The dump does contain the revision, so the problem seems like it is on the wikitech-static import side.

MWUnknownContentModelException from line 306 of /srv/mediawiki/w/includes/content/ContentHandler.php: The content model 'yaml' is not registered on this wiki.

The problem is that the dump contains <model>yaml</model> pages and wikitech-static does not have the OpenStackManager extension installed to provide a ContentHandler for that model.

All the pages that come in the dump after the first <model>yaml</model> page are missed when loading.

Change 382520 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wikitech-static: convert yaml content to wikitext when dumping wikitech

https://gerrit.wikimedia.org/r/382520

Change 382520 merged by Andrew Bogott:
[operations/puppet@production] wikitech-static: convert yaml content to wikitext when dumping wikitech

https://gerrit.wikimedia.org/r/382520

I ran the export and import by hand just now, and I think we're getting the complete wiki.

Looks good to me. The root problem was 2 missing content handlers. One comes from OpenStackManager which we don't really want to install so the dump process is now marking pages like https://wikitech-static.wikimedia.org/wiki/Hiera:Deployment-prep as wikitext instead of yaml. The other was due to an outdated version of TemplateStyles. This was fixed by updating wikitech-static to the REL1_30 MediaWiki release which is currently in pre-release beta testing.