Page MenuHomePhabricator

Contributor ID field has empty instances in 2019-05-01 dumps (was 0 in previous month)
Closed, ResolvedPublic

Description

This month job converting XML-history-dumps to parquet on Hadoop failed becasue of a format issue.
Investigation have shown that some revisions have empty XML element as contributor.id: <id />, while they were having <id>0</id> in previous dumps version (tested on zhwikisource and frwikisource).
Qunatification:

  • 2019-05 zhwikisource empty ids: 10311
  • 2019-04 zhwikisource empty ids: 0

Example of revision with empty id from 2019-05 zhwikisource:

<revision>
  <id>209484</id>
  <timestamp>2001-01-15T00:00:00Z</timestamp>
  <contributor>
    <username />
    <id />
  </contributor>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text xml:space="preserve">{{copy|http://open-lit.com}}</text>
  <sha1>91nyffwsl71ubxm0dji7u1bljr5d8fl</sha1>
</revision>

Same revision on 2019-04 zhwikisource:

<revision>
   <id>209484</id>
   <timestamp>2001-01-15T00:00:00Z</timestamp>
   <contributor>
     <username />
     <id>0</id>
   </contributor>
   <model>wikitext</model>
   <format>text/x-wiki</format>
   <text xml:space="preserve">{{copy|http://open-lit.com}}</text>
   <sha1>91nyffwsl71ubxm0dji7u1bljr5d8fl</sha1>
 </revision>

Event Timeline

Restricted Application added subscribers: Liuxinyu970226, Cosine02. · View Herald TranscriptMay 23 2019, 1:09 PM

Filling some more details.

The same is visible in the stubs file. Here's the relevant excerpt from zhwikisource-20190420-stub-meta-history.xml.gz:

  <page>
    <title>西遊記</title>
    <ns>0</ns>
    <id>305</id>
...
    <revision>
      <id>193892</id>
      <parentid>168355</parentid>
      <timestamp>2009-02-13T04:19:35Z</timestamp>
      <contributor>
        <username>Wmrwiki</username>
        <id>2632</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text id="176668" bytes="8134" />
      <sha1>353jnwdvy9hx51y8sx2wh81calrs2pe</sha1>
    </revision>
    <revision>
      <id>209484</id>
      <timestamp>2001-01-15T00:00:00Z</timestamp>
      <contributor>
        <username />
        <id />
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text id="188168" bytes="28" />
      <sha1>91nyffwsl71ubxm0dji7u1bljr5d8fl</sha1>
    </revision>
    <revision>
      <id>209485</id>
      <parentid>193892</parentid>
      <timestamp>2009-03-25T12:57:56Z</timestamp>
      <contributor>
        <username>Wmrwiki</username>
        <id>2632</id>
      </contributor>
...

Note the bogus timestamp as well.

From zhwikisource-20190401-stub-meta-history.xml.gz:

<revision>
  <id>209484</id>
  <timestamp>2001-01-15T00:00:00Z</timestamp>
  <contributor>
    <username />
    <id>0</id>
  </contributor>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text id="188168" bytes="28" />
  <sha1>91nyffwsl71ubxm0dji7u1bljr5d8fl</sha1>
</revision>

The same bogus timestamp but the id field is rendered as 0.

The entry in the database, for completeness' sake:

wikiadmin@10.64.16.191(zhwikisource)> select * from revision where rev_id = 209484;
+--------+----------+-------------+-------------+----------+---------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
| rev_id | rev_page | rev_text_id | rev_comment | rev_user | rev_user_text | rev_timestamp  | rev_minor_edit | rev_deleted | rev_len | rev_parent_id | rev_sha1                        | rev_content_model | rev_content_format |
+--------+----------+-------------+-------------+----------+---------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
| 209484 |      305 |      188168 |             |        0 |               | 20010115000000 |              0 |           0 |      28 |             0 | 91nyffwsl71ubxm0dji7u1bljr5d8fl | NULL              | NULL               |
+--------+----------+-------------+-------------+----------+---------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
1 row in set (0.00 sec)

rev_user is clearly 0 there.

Now a look at the actor tables:

wikiadmin@10.64.16.191(zhwikisource)> select * from revision_actor_temp where revactor_rev = 209484;
+--------------+----------------+--------------------+---------------+
| revactor_rev | revactor_actor | revactor_timestamp | revactor_page |
+--------------+----------------+--------------------+---------------+
|       209484 |          72325 | 20010115000000     |           305 |
+--------------+----------------+--------------------+---------------+
1 row in set (0.00 sec)

wikiadmin@10.64.16.191(zhwikisource)> select * from actor where ac
acewiki           actor             actor.actor_id    actor.actor_name  actor.actor_user  actor_id          actor_name        actor_user        
wikiadmin@10.64.16.191(zhwikisource)> select * from actor where actor_id = 72325;
+----------+------------+------------+
| actor_id | actor_user | actor_name |
+----------+------------+------------+
|    72325 |       NULL |            |
+----------+------------+------------+
1 row in set (0.00 sec)

Well that's a bit inconvenient.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.May 23 2019, 2:04 PM

Both files wee generated with

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="zh">

Time of the Apr 20th file:

-rw-r--r-- 1 dumpsgen dumpsgen       89 Apr 21 00:48 sha1sums-zhwikisource-20190420-stub-meta-history.xml.gz.txt

So it's a config change that did us in.most likely.

Here it is, merged Apr 15: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/504011/

This means that we get the value out of the actor table indeed, see getJoin(), line 153 here: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/includes/ActorMigration.php

Since that's null, XmlDumpWriter::writeContributor passes null in to Xml::element( 'id', null, strval( $id ) )
Here's the relevant part of Xml::element:

if ( is_null( $contents ) ) {
        $out .= '>';
} elseif ( $allowShortTag && $contents === '' ) {
        $out .= ' />';
} else {
        $out .= '>' . htmlspecialchars( $contents, ENT_NOQUOTES ) . "</$element>";
}

Since we now get here with an empty string instead of 0 for the id, we get the different tag.

Pinging @Anomie to see what the cleverest way out of this issue might be.

Pinging @Anomie to see what the cleverest way out of this issue might be.

Xml::element( 'id', null, strval( $id ?: 0 ) ) seems most straightforward if you want to continue to have <id>0</id> in the XML when there's no user ID.

Or you could adjust the callers similarly (like $this->writeContributor( $row->xx_user ?: 0, $row->xx_user_text )), or find where the query is built and adjust the SQL to do COALESCE( xx_user, 0 ). We can't have ActorMigration itself do the COALESCE() though, since that will cause various other queries to be unable to properly use indexes (similar to the situation from T221339#5191177, where a WMCS view had a similar coalesce).

...

Xml::element( 'id', null, strval( $id ?: 0 ) ) seems most straightforward if you want to continue to have <id>0</id> in the XML when there's no user ID.
Or you could adjust the callers similarly (like $this->writeContributor( $row->xx_user ?: 0, $row->xx_user_text )), or find where the query is built and adjust the SQL to do COALESCE( xx_user, 0 ). We can't have ActorMigration itself do the COALESCE() though, since that will cause various other queries to be unable to properly use indexes (similar to the situation from T221339#5191177, where a WMCS view had a similar coalesce).

I think fixing up writeContributor is the safest bet; it changes only that specific behavior without other side effects, without impacting the query itself. Thanks!

Change 512672 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] make sure revision uids are 0 in the xml if missing/0 in the db

https://gerrit.wikimedia.org/r/512672

I have tested the above patch with the full stubs dump from zhwikisource and it writes out <id>0</id> everywhere that the lsat run had <id /> without changing anything else, so it's ready for review/merge.

daniel added a subscriber: daniel.May 28 2019, 11:59 AM

So, the root cause is that non-existing users are represented by actor_user = NULL instead of actor_user = 0?
NULL is semantically nicer, but I fear this inconsistency is going to bite us in a lot of places now... Any code that is migrated from using user_rev to using actor will have to know about this special case...

I don't know what else in MW relies on having a 0 uid instead of NULL, but the dumps for sure.

Change 512672 merged by jenkins-bot:
[mediawiki/core@master] make sure revision uids are 0 in the xml if missing/0 in the db

https://gerrit.wikimedia.org/r/512672

The current run will still have the uid issue with the stubs. However, if all goes well with the train this week, the next run on the 20th should have the issue fixed.

ArielGlenn closed this task as Resolved.Jun 24 2019, 6:00 AM

The current run shows no bad uids for zhwikisource stubs so this is done.