Page MenuHomePhabricator

Wikidata skipped dispatching some changes during the eqiad->codfw switch
Closed, ResolvedPublic

Description

SAL
13:55 ori: [switchover #1]: disabling eqiad jobrunners via "salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'service jobrunner stop; service jobchron stop;'".
13:56 _joe_: [switchover #3] disabling cronjobs on terbium
15:02 _joe_: [switchover #13] starting maintenace jobs
15:10 logmsgbot: ori@tin Synchronized wmf-config/ProductionServices.php: live-hack fix for rdb2*.eqiad (duration: 00m 34s)

During the switch there were times that the maint scripts were running (so the dispatcher in our case) but job queues were not able to be written to.
ie. between 13:55 and 13:56 and then between 15:02 and 15:10 (the whole windows being 13:55 to 15:10)

As far as I can tell in the case of wikidata this causes the dispatchers to run and try to queue jobs. The jobs will not get queued.
Looking at the code and speaking to @daniel this should not cause things to get missed.

However the following change was not dispatched to fiwiki

select * from wb_changes where change_revision_id = 323124588 limit 1;
+-----------+----------------------+----------------+------------------+--------------------+----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| change_id | change_type          | change_time    | change_object_id | change_revision_id | change_user_id | change_info
+-----------+----------------------+----------------+------------------+--------------------+----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 321229424 | wikibase-item~update | 20160419145156 | q1926802         |          323124588 |        1896387 | {"diff":{"type":"diff\/item","isassoc":true,"operations":{"links":{"type":"diff","isassoc":true,"operations":{"fiwiki":{"type":"diff","isassoc":true,"operations":{"name":{"type":"add","newvalue":"Michael Valgren"}}}}},"claim":{"type":"diff","isassoc":true,"operations":[]}}},"metadata":{"user_text":"Quinn","page_id":1856525,"parent_id":319636412,"comment":"\/* wbsetsitelink-add:1|fiwiki *\/ Michael Valgren","rev_id":323124588}} |
+-----------+----------------------+----------------+------------------+--------------------+----------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.03 sec)

Until I manually poked the item https://www.wikidata.org/w/index.php?title=Q1926802&action=history

Another example that I have no fiddled with yet is a few rows down in the changes table.

https://www.wikidata.org/w/index.php?title=Q19933855&oldid=323124612
Link change not dispatched to https://de.wikipedia.org/wiki/Richard_Hamann_(Sanit%C3%A4tsoffizier) so no links in the side bar

| 321229448 | wikibase-item~update | 20160419145212 | q19933855        |          323124612 |        1895734 | {"diff":{"type":"diff\/item","isassoc":true,"operations":{"links":{"type":"diff","isassoc":true,"operations":{"dewiki":{"type":"diff","isassoc":true,"operations":{"name":{"type":"add","newvalue":"Richard Hamann (Sanit\u00e4tsoffizier)"}}}}},"claim":{"type":"diff","isassoc":true,"operations":[]}}},"metadata":{"user_text":"M2k~dewiki","page_id":21575717,"parent_id":316407931,"comment":"\/* wbsetsitelink-add:1|dewiki *\/ Richard Hamann (Sanit\u00e4tsoffizier)","rev_id":323124612}}

Per something @daniel just said we could always take the change rows out between 2 timestamps and add them to the table again (causing them to be re dispatched)
I uses we should avoid doing this for things that have since had other changes dispatched?
I feel a maint script is needed here :)

Event Timeline

For reference: the ChangeDispatcher (using ChangeDispatchCoordinator) keeps track of which changes have been dispatched to which client wiki in the wb_changes_dispatch table. The chd_seen field indicates which ch_id has been dispatched to ("seen") by each client wiki. When ChangeDispatcher fails to queue jobs for a given client wiki, the chd_seen pointer should not be updated, causing the batch to be re-tried later. Apparently, this does not quite work as expected. Perhaps some optimization based on chd_touched is getting in the way.

The critical code is in ChangeDispatcher::dispatchTo; the chd_seen is set by $wikiState['chd_seen'] = $continueAfter.

Change 284440 had a related patch set uploaded (by Addshore):
Add requeueChanges maint script

https://gerrit.wikimedia.org/r/284440

Addshore renamed this task from Wikidata skipped dispatching some changes during the wqiad->codfw switch to Wikidata skipped dispatching some changes during the eqiad->codfw switch.Apr 21 2016, 8:20 AM

Wikidata team, what are the actionables here? How can we make datacenter switchovers smoother for Wikidata?

Adding for next story time to find the actionables.

While T133144#2222515 has an idea what went wrong another idea would be that there is some bug that made dispatching not notice that it couldn't create jobs.
From "job queues were not able to be written to" I assume the EnqueueJob strategy with two job queues was not implemented. And instead there was one job queue and trying to insert a job fails because the redis server was a slave in read only mode (i.e. job redis were switched while crons were not yet). If that is what happened this can be tested in a mediawiki vagrant instance (it comes with the job queue using redis).

While moving away from dispatching via cron to a delayed job (T48643) might lessen the symptoms, it is probably worth it to track down this bug even if we switch to that.

Change 284440 abandoned by Addshore:
Add requeueChanges maint script

https://gerrit.wikimedia.org/r/284440