Contributor ID field has empty instances in 2019-05-01 dumps (was 0 in previous month)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JAllemandou
	May 23 2019, 1:09 PM

Description

This month job converting XML-history-dumps to parquet on Hadoop failed becasue of a format issue.
Investigation have shown that some revisions have empty XML element as contributor.id: <id />, while they were having <id>0</id> in previous dumps version (tested on zhwikisource and frwikisource).
Qunatification:

2019-05 zhwikisource empty ids: 10311
2019-04 zhwikisource empty ids: 0

Example of revision with empty id from 2019-05 zhwikisource:

<revision>
  <id>209484</id>
  <timestamp>2001-01-15T00:00:00Z</timestamp>
  <contributor>
    <username />
    <id />
  </contributor>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text xml:space="preserve">{{copy|http://open-lit.com}}</text>
  <sha1>91nyffwsl71ubxm0dji7u1bljr5d8fl</sha1>
</revision>

Same revision on 2019-04 zhwikisource:

<revision>
   <id>209484</id>
   <timestamp>2001-01-15T00:00:00Z</timestamp>
   <contributor>
     <username />
     <id>0</id>
   </contributor>
   <model>wikitext</model>
   <format>text/x-wiki</format>
   <text xml:space="preserve">{{copy|http://open-lit.com}}</text>
   <sha1>91nyffwsl71ubxm0dji7u1bljr5d8fl</sha1>
 </revision>

Details

	Subject	Repo	Branch	Lines +/-
	make sure revision uids are 0 in the xml if missing/0 in the db	mediawiki/core	master	+2 -1

Customize query in gerrit

Related Objects

Mentioned In: T228883: mediawiki-history-wikitext-coord job fails every month
Mentioned Here: T221339: Missing index on revision_userindex.rev_actor

Event Timeline

JAllemandou created this task.May 23 2019, 1:09 PM

Restricted Application added subscribers: Liuxinyu970226, Stang. · View Herald TranscriptMay 23 2019, 1:09 PM

Filling some more details.

The same is visible in the stubs file. Here's the relevant excerpt from zhwikisource-20190420-stub-meta-history.xml.gz:

  <page>
    <title>西遊記</title>
    <ns>0</ns>
    <id>305</id>
...
    <revision>
      <id>193892</id>
      <parentid>168355</parentid>
      <timestamp>2009-02-13T04:19:35Z</timestamp>
      <contributor>
        <username>Wmrwiki</username>
        <id>2632</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text id="176668" bytes="8134" />
      <sha1>353jnwdvy9hx51y8sx2wh81calrs2pe</sha1>
    </revision>
    <revision>
      <id>209484</id>
      <timestamp>2001-01-15T00:00:00Z</timestamp>
      <contributor>
        <username />
        <id />
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text id="188168" bytes="28" />
      <sha1>91nyffwsl71ubxm0dji7u1bljr5d8fl</sha1>
    </revision>
    <revision>
      <id>209485</id>
      <parentid>193892</parentid>
      <timestamp>2009-03-25T12:57:56Z</timestamp>
      <contributor>
        <username>Wmrwiki</username>
        <id>2632</id>
      </contributor>
...

Note the bogus timestamp as well.

From zhwikisource-20190401-stub-meta-history.xml.gz:

<revision>
  <id>209484</id>
  <timestamp>2001-01-15T00:00:00Z</timestamp>
  <contributor>
    <username />
    <id>0</id>
  </contributor>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text id="188168" bytes="28" />
  <sha1>91nyffwsl71ubxm0dji7u1bljr5d8fl</sha1>
</revision>

The same bogus timestamp but the id field is rendered as 0.

The entry in the database, for completeness' sake:

wikiadmin@10.64.16.191(zhwikisource)> select * from revision where rev_id = 209484;
+--------+----------+-------------+-------------+----------+---------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
| rev_id | rev_page | rev_text_id | rev_comment | rev_user | rev_user_text | rev_timestamp  | rev_minor_edit | rev_deleted | rev_len | rev_parent_id | rev_sha1                        | rev_content_model | rev_content_format |
+--------+----------+-------------+-------------+----------+---------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
| 209484 |      305 |      188168 |             |        0 |               | 20010115000000 |              0 |           0 |      28 |             0 | 91nyffwsl71ubxm0dji7u1bljr5d8fl | NULL              | NULL               |
+--------+----------+-------------+-------------+----------+---------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
1 row in set (0.00 sec)

rev_user is clearly 0 there.

Now a look at the actor tables:

wikiadmin@10.64.16.191(zhwikisource)> select * from revision_actor_temp where revactor_rev = 209484;
+--------------+----------------+--------------------+---------------+
| revactor_rev | revactor_actor | revactor_timestamp | revactor_page |
+--------------+----------------+--------------------+---------------+
|       209484 |          72325 | 20010115000000     |           305 |
+--------------+----------------+--------------------+---------------+
1 row in set (0.00 sec)

wikiadmin@10.64.16.191(zhwikisource)> select * from actor where ac
acewiki           actor             actor.actor_id    actor.actor_name  actor.actor_user  actor_id          actor_name        actor_user        
wikiadmin@10.64.16.191(zhwikisource)> select * from actor where actor_id = 72325;
+----------+------------+------------+
| actor_id | actor_user | actor_name |
+----------+------------+------------+
|    72325 |       NULL |            |
+----------+------------+------------+
1 row in set (0.00 sec)

Well that's a bit inconvenient.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.May 23 2019, 2:04 PM

Both files wee generated with

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="zh">

Time of the Apr 20th file:

-rw-r--r-- 1 dumpsgen dumpsgen       89 Apr 21 00:48 sha1sums-zhwikisource-20190420-stub-meta-history.xml.gz.txt

So it's a config change that did us in.most likely.

Here it is, merged Apr 15: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/504011/

This means that we get the value out of the actor table indeed, see getJoin(), line 153 here: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/includes/ActorMigration.php

Since that's null, XmlDumpWriter::writeContributor passes null in to Xml::element( 'id', null, strval( $id ) )
Here's the relevant part of Xml::element:

if ( is_null( $contents ) ) {
        $out .= '>';
} elseif ( $allowShortTag && $contents === '' ) {
        $out .= ' />';
} else {
        $out .= '>' . htmlspecialchars( $contents, ENT_NOQUOTES ) . "</$element>";
}

Since we now get here with an empty string instead of 0 for the id, we get the different tag.

Pinging @Anomie to see what the cleverest way out of this issue might be.

In T224221#5209834, @ArielGlenn wrote:

Pinging @Anomie to see what the cleverest way out of this issue might be.

Xml::element( 'id', null, strval( $id ?: 0 ) ) seems most straightforward if you want to continue to have <id>0</id> in the XML when there's no user ID.

Or you could adjust the callers similarly (like $this->writeContributor( $row->xx_user ?: 0, $row->xx_user_text )), or find where the query is built and adjust the SQL to do COALESCE( xx_user, 0 ). We can't have ActorMigration itself do the COALESCE() though, since that will cause various other queries to be unable to properly use indexes (similar to the situation from T221339#5191177, where a WMCS view had a similar coalesce).

In T224221#5210554, @Anomie wrote:

...

Xml::element( 'id', null, strval( $id ?: 0 ) ) seems most straightforward if you want to continue to have <id>0</id> in the XML when there's no user ID.

Or you could adjust the callers similarly (like $this->writeContributor( $row->xx_user ?: 0, $row->xx_user_text )), or find where the query is built and adjust the SQL to do COALESCE( xx_user, 0 ). We can't have ActorMigration itself do the COALESCE() though, since that will cause various other queries to be unable to properly use indexes (similar to the situation from T221339#5191177, where a WMCS view had a similar coalesce).

I think fixing up writeContributor is the safest bet; it changes only that specific behavior without other side effects, without impacting the query itself. Thanks!

Change 512672 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] make sure revision uids are 0 in the xml if missing/0 in the db

https://gerrit.wikimedia.org/r/512672

gerritbot added a project: Patch-For-Review.May 27 2019, 1:05 PM

I have tested the above patch with the full stubs dump from zhwikisource and it writes out <id>0</id> everywhere that the lsat run had <id /> without changing anything else, so it's ready for review/merge.

So, the root cause is that non-existing users are represented by actor_user = NULL instead of actor_user = 0?
NULL is semantically nicer, but I fear this inconsistency is going to bite us in a lot of places now... Any code that is migrated from using user_rev to using actor will have to know about this special case...

I don't know what else in MW relies on having a 0 uid instead of NULL, but the dumps for sure.

Change 512672 merged by jenkins-bot:
[mediawiki/core@master] make sure revision uids are 0 in the xml if missing/0 in the db

https://gerrit.wikimedia.org/r/512672

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.8; 2019-06-04).Jun 3 2019, 6:00 PM

Maintenance_bot removed a project: Patch-For-Review.Jun 3 2019, 6:10 PM

The current run will still have the uid issue with the stubs. However, if all goes well with the train this week, the next run on the 20th should have the issue fixed.

ArielGlenn moved this task from Active to Blocked/Stalled/Waiting for event on the Dumps-Generation board.Jun 10 2019, 5:22 AM

The current run shows no bad uids for zhwikisource stubs so this is done.

ArielGlenn moved this task from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.Jun 27 2019, 7:49 AM

• Nuria mentioned this in T228883: mediawiki-history-wikitext-coord job fails every month .Jul 24 2019, 3:10 PM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:38 PM

Stang unsubscribed.Nov 13 2021, 11:23 PM

Contributor ID field has empty instances in 2019-05-01 dumps (was 0 in previous month)Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Contributor ID field has empty instances in 2019-05-01 dumps (was 0 in previous month)
Closed, ResolvedPublic
Actions