Page MenuHomePhabricator

Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance
Closed, ResolvedPublic

Description

Hi all,

as part of https://phabricator.wikimedia.org/T123675 I was trying to isolate rdb1003 from the JobQueues pool to re-image it with Debian Jessie.

Code review for the change: https://gerrit.wikimedia.org/r/#/c/274411/ (followed by an immediate revert)

The idea was to remove rdb1003 completely from wmf-config without failing over to its slave to avoid running a shard without its backup. The next steps (as described in the phab task) would have been to let the jobrunners drain the job queues and eventually remove rdb1003 from heira too (jobrunners config).

The plan failed miserably. Right after the sync on tin I observed tons of NOTICE events in logstash hhvm, some of them like:

Notice: Undefined index: 0 in /srv/mediawiki/php-1.27.0wmf.14/includes/search/SearchExactMatchRescorer.php on line 44

Notice: JobQueueGroup::__destruct: 1 buffered job(s) of type(s) EnqueueJob never inserted. in /srv/mediawiki/php-1.27.0-wmf.14/includes/jobqueue/JobQueueGroup.php on line 421

More info in https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm

From: https://wikitech.wikimedia.org/wiki/Server_Admin_Log (UTC):

12:07 logmsgbot: elukey@tin Synchronized wmf-config/jobqueue-eqiad.php: Remove rdb1003 from the Redis JobQueue pool for maintenance (duration: 00m 32s)

12:12 logmsgbot: elukey@tin Synchronized wmf-config/jobqueue-eqiad.php: Revert - Remove rdb1003 from the Redis JobQueue pool for maintenance (duration: 00m 28s)

Another event triggered by this issue:

12:11  <icinga-wm> PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1138 bytes in 0.056 second response time

12:16  <icinga-wm> RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1657 bytes in 0.152 second response time

Of course I am terribly sorry to have caused this incident, I am going to stop my task until we have a clear idea about how to de-pool/re-pool rdb1003 safely (this task is a blocker for the codfw switchover though).

Thanks!

Luca

Event Timeline

elukey created this task.Mar 3 2016, 12:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 3 2016, 12:40 PM
elukey triaged this task as High priority.Mar 3 2016, 12:41 PM
elukey updated the task description. (Show Details)
aaron added a comment.Mar 4 2016, 7:35 PM

How would the queues on 1003 drain if it's depooled? Wouldn't they just be left as they were before?

aaron added a comment.Mar 4 2016, 7:48 PM

The SearchExactMatchRescorer seems unrelated, expecially given all the random spikes at https://logstash.wikimedia.org/#dashboard/temp/AVNDLJInO3D718AOklso

elukey added a comment.Mar 7 2016, 9:09 AM

Hi Aaron!

How would the queues on 1003 drain if it's depooled? Wouldn't they just be left as they were before?

AFAIK removing 1003 from wmf-config/jobqueue-eqiad.php would only remove the possibility for MediaWiki to create jobs, meanwhile the jobrunner configuration (especially the hosts from which they retrieve the jobs to execute) is in puppet (hieradata/eqiad/mediawiki/jobrunner.yaml).

Joe added a comment.Mar 7 2016, 9:13 AM

To better frame the issue:

We need a reliable method to depool one rdb host. The sequence should be something like:

  1. MediaWiki stops writing messages to server X
  2. The jobrunners consume all the jobs from server X, until there is none left
  3. We do the needed maintenance
  4. We repool server X and jobs are both enqueued and consumed

At this point it's not clear to me which code/config changes can achieve this.

@aaron could you suggest a way to safely do all of the above?

elukey added a comment.Mar 7 2016, 9:19 AM

Would it be possible that sync-file wmf-config/jobqueue-eqiad.php is not enough to clear the following configuration?

https://github.com/wikimedia/mediawiki/blob/wmf/1.27.0-wmf.15/includes/jobqueue/JobQueueGroup.php#L355

aaron added a comment.Mar 7 2016, 9:35 AM

During the buffered job warning flood there was also a related flood of "No configuration for partition 'rdb2-6379'" exceptions in similar volume.

That patch should have instead changed 'partitionsBySection' rather than changing 'configByPartition'.

Change 275585 had a related patch set uploaded (by Elukey):
Remove rdb1003.eqiad from the Redis Job Queues for maintenance.

https://gerrit.wikimedia.org/r/275585

elukey added a comment.Mar 7 2016, 6:24 PM

@aaron: I didn't see the partition configuration in the file and I should have probably asked twice. I am planning to add documentation in https://wikitech.wikimedia.org/wiki/Service_restarts with a precise set of indications to prevent this issue from happening again.

Does https://gerrit.wikimedia.org/r/#/c/275585/1 looks sound for you?

aaron added a comment.Mar 7 2016, 9:44 PM

To better frame the issue:
We need a reliable method to depool one rdb host. The sequence should be something like:

  1. MediaWiki stops writing messages to server X
  2. The jobrunners consume all the jobs from server X, until there is none left
  3. We do the needed maintenance
  4. We repool server X and jobs are both enqueued and consumed

At this point it's not clear to me which code/config changes can achieve this.
@aaron could you suggest a way to safely do all of the above?

Why not either:
a) Just depool it totally (leaving the jobs there), re-image, restore the data from the slave if it was wiped too, and re-pool
b) If that will take a very long time, then set a spare server to slave the slave and promote the old slave to master, de-pooling the old master and re-imaging, then switching back afterwards (unless the server is decommed or something)

The slots (e.g. rdbY-X) in the configure are not meant to normally be removed completely, just briefly disabled/ignored. New servers can totally take the place of old servers in a slot via slaving/promotion, but removing a completely for a long time is tricky to do atm.

It wouldn't be *too* much work to add code and a script to move jobs around partition slots with JobQueueFederated. I'd prefer this over the "drain" approach since the drain time depends on the backlog time of the worst queue, which could be minutes or even a week (e.g. a backlog of transcode jobs or something). Moving jobs off of inactive partition slots doesn't have this limitation.

elukey added a comment.Mar 8 2016, 4:13 PM

Hi @aaron, adding some comments:

Why not either:
a) Just depool it totally (leaving the jobs there), re-image, restore the data from the slave if it was wiped too, and re-pool

So the procedure could be the following (picking up rdb1003 as master and rdb1004 as slave):

  1. Send a code review to remove rdb1003 like https://gerrit.wikimedia.org/r/#/c/275585/1 and deploy. Remove rdb1003 from the Job Runners config in heira.
  2. set SLAVEOF NO ONE in rdb1004 to turn it into a Master and stop the replication.
  3. Re-image rdb1003
  4. Once rdb1003 is up and running, set it SLAVEOF rdb1004 to get it in sync
  5. Once the replication is done, set rdb1003 SLAVEOF NO ONE
  6. Set rdb1004 SLAVEOF rdb1003
  7. revert the code review and add rdb1003 back into the various configs.

The slots (e.g. rdbY-X) in the configure are not meant to normally be removed completely, just briefly disabled/ignored. New servers can totally take the place of old servers in a slot via slaving/promotion, but removing a completely for a long time is tricky to do atm.

Can you add a bit more info? How can this affect the above procedure? (what can go wrong, etc..)

It wouldn't be *too* much work to add code and a script to move jobs around partition slots with JobQueueFederated. I'd prefer this over the "drain" approach since the drain time depends on the backlog time of the worst queue, which could be minutes or even a week (e.g. a backlog of transcode jobs or something). Moving jobs off of inactive partition slots doesn't have this limitation.

This is really interesting, but I guess we should have a better understanding about the timings. Thanks for the tip about the long running jobs, sounds a big blocker for the drain approach.

Thanks!

Luca

aaron added a comment.Mar 8 2016, 7:33 PM

That procedure makes sense.

As far as the slots go, having one disabled just means all the jobs in it get delayed for that long, which can cause problems if hours go by. Removing a slot entirely requires either draining it (there is no config option to support popping without pushing) or moving jobs around (there is no easy script to do this atm).

Change 275585 abandoned by Elukey:
Remove rdb1003.eqiad from the Redis Job Queues for maintenance.

Reason:
Followed another procedure with _joe_

https://gerrit.wikimedia.org/r/275585

elukey added a comment.Mar 9 2016, 5:49 PM

@aaron: today I reimaged rdb1003 using the following procedure:

https://etherpad.wikimedia.org/p/redis-rdb-reimaging

Everything went fine but as you said the downside of this approach is delaying a lot jobs causing some impact. I am planning to move forward with rdb1001 and rdb1005 if you are ok tomorrow, in order to unblock https://phabricator.wikimedia.org/T124673.

Let me know!

Luca

elukey closed this task as Resolved.Mar 14 2016, 11:05 AM