Page MenuHomePhabricator

deployment-prep has jobqueue issues
Closed, ResolvedPublic

Description

Millions (literally) of error messages like Notice: Undefined index: channels in /srv/mediawiki/php-master/includes/objectcache/ObjectCache.php on line 340 since about 2018-04-17/18 midnight UTC; which is apparently breaking T192471 and another global rename I was requested to perform today. Thanks.

https://logstash-beta.wmflabs.org/goto/d4ac7b432beed9503464aa12e97930e1

Event Timeline

Also: Warning: Invalid argument supplied for foreach() in /srv/mediawiki/php-master/includes/objectcache/ObjectCache.php on line 343

MarcoAurelio updated the task description. (Show Details)
MarcoAurelio added subscribers: Rxy, hoo.

@hoo and @Rxy suggests this may be nutcracker refusing to stay up and running. Granted, when I sudo puppet agent -tv it stays up for a minute or two, then goes down again.

hoo@deployment-jobrunner03:~$ sudo tail /var/log/nutcracker/nutcracker.log
[2018-04-18 21:18:53.606] nc_proxy.c:148 bind on p 11 to addr '/var/run/nutcracker/nutcracker.sock 0666' failed: No such file or directory
…
MarcoAurelio renamed this task from Notice: Undefined index: channels in /srv/mediawiki/php-master/includes/objectcache/ObjectCache.php on line 340 to redis/nutcracker down on deployment-prep.Apr 18 2018, 9:44 PM
MarcoAurelio updated the task description. (Show Details)

nutcracker now starts on deployment-jobrunner03, see my comment on T178457. The error in the task description still happens though.

That patch change $wgMainWANCache. Both by timing and content of it I think it's very likely to be reason we see tons of Notice: Undefined index: channels in /srv/mediawiki/php-master/includes/objectcache/ObjectCache.php on line 340, which in turn is probably the reason the job queue is broken.

EddieGP renamed this task from redis/nutcracker down on deployment-prep to deployment-prep has jobqueue/caching issues.Apr 19 2018, 12:17 AM
EddieGP updated the task description. (Show Details)

Change 427590 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Avoid warnings in ObjectCache::newWANCacheFromParams if "channels" is not set"

https://gerrit.wikimedia.org/r/427590

The warnings are pointless, the patch above adds an isset() check.

EddieGP lowered the priority of this task from Unbreak Now! to Low.Apr 19 2018, 8:24 AM

Per aarons comment, just logspam. Seems the actual problem for renames was nutcracker, which I fixed in T192473#4141169. Both are now unstuck.

Change 427590 merged by jenkins-bot:
[mediawiki/core@master] Avoid warnings in ObjectCache::newWANCacheFromParams if "channels" is not set"

https://gerrit.wikimedia.org/r/427590

The error mentioned is gone, thanks. However we still have issues: T192604. Thanks.

EddieGP raised the priority of this task from Low to High.Apr 19 2018, 11:16 PM

Indeed, the jobqueue on beta is still broken, although it's unrelated to the logspam.
Thanks to Antoine, I now know that jobs are queued in kafka. Fiddling a bit on deployment-kafka04 gives me the impression that all the jobs made it there and are sitting happily in the queue. Somehow they don't get consumed. By the description of the profile::cpjobqueue I think deployment-cpjobqueue consumes the messages from kafka and triggers the jobs on deployment-jobrunner03. There's some errors on deployment-cpjobqueue in logstash, I might take a look on those tomorrow - or rather, later today in my TZ.

EddieGP renamed this task from deployment-prep has jobqueue/caching issues to deployment-prep has jobqueue issues.Apr 20 2018, 10:00 PM

@EddieGP deployment-cpjobqueue puppet was broken due to disk full; this was fixed today at T193127. However since March 20 logstash-beta is not recording any message, which is being tracked at T193133 and, of course, global rename and all tasks depending on job queue are still not running/broken :-(

Who can we approach to have the jobqueue issues resolved, @greg ?

Thanks.

Also, there are stalled jobs in the job table:

wikiadmin@deployment-db04[deploymentwiki]> select count(*) from job;
+----------+
| count(*) |
+----------+
|       31 |
+----------+
1 row in set (0.00 sec)
wikiadmin@deployment-db04[deploymentwiki]> select job_id, job_cmd, job_timestamp from job;
+--------+---------------+----------------+
| job_id | job_cmd       | job_timestamp  |
+--------+---------------+----------------+
|      1 | refreshLinks2 | 20121102235108 |
|      2 | enotifNotify  | 20121105052726 |
|      3 | enotifNotify  | 20121121052728 |
|      4 | enotifNotify  | 20121202013726 |
|      5 | refreshLinks2 | 20121227143406 |
|      6 | refreshLinks2 | 20121227143638 |
|      7 | refreshLinks2 | 20121227143816 |
|      8 | refreshLinks2 | 20121227143908 |
|      9 | refreshLinks2 | 20121227143945 |
|     10 | refreshLinks2 | 20121227152308 |
|     11 | refreshLinks2 | 20121227152442 |
|     12 | refreshLinks2 | 20121227152521 |
|     13 | refreshLinks2 | 20121227152551 |
|     14 | refreshLinks2 | 20121227152639 |
|     15 | refreshLinks2 | 20121227152749 |
|     16 | refreshLinks2 | 20121227152849 |
|     17 | refreshLinks2 | 20121227152935 |
|     18 | refreshLinks2 | 20121227152959 |
|     19 | refreshLinks2 | 20121227153406 |
|     20 | refreshLinks2 | 20121227153436 |
|     21 | refreshLinks2 | 20121227154227 |
|     22 | refreshLinks2 | 20121227154257 |
|     23 | refreshLinks2 | 20121227154307 |
|     24 | refreshLinks2 | 20121227154335 |
|     25 | refreshLinks2 | 20121227154401 |
|     26 | refreshLinks2 | 20121227154423 |
|     27 | refreshLinks2 | 20121227154509 |
|     28 | refreshLinks2 | 20121227160211 |
|     29 | refreshLinks2 | 20121227160303 |
|     30 | enotifNotify  | 20130226015129 |
|     31 | enotifNotify  | 20130409222941 |
+--------+---------------+----------------+
31 rows in set (0.00 sec)

I don't think years-old jobs should be there as per https://www.mediawiki.org/wiki/Manual:Job_table

Once upon a time, appservers would insert jobs into a database table and jobrunner servers would read 'em from there.

But then the great rise of redis began (or was there another step in between?), and appservers would push jobs into redis storage, and a script called 'jobrunner' would pull them and execute them on servers called jobrunners. And when switching away from db-based storage for jobs, some old jobs apparently just were left behind. At this point I can hardly imagine anyone still cares about them any more, the table could as well be truncated.

Now there is kafka, and app servers (like deployment-mediawiki-07) insert jobs in to 'topics' there, and some other host (deployment-cpjobqueue) fetches the job description from these 'topic' and creates a http request with that description. And the http request triggers the execution of the job on the servers called 'jobrunner' (deployment-jobrunner02), while the script called 'jobrunner' is no longer involved at all.

When you execute runJobs.php or showJobs.php you don't see anything, because those scripts cannot query the queue, neither for jobs nor for the queue size. mediawikis kafka storage for jobqueue is push-only ( https://wikitech.wikimedia.org/wiki/Kafka_Job_Queue ), which means that mediawiki itself does never access the queue but only executes jobs when commanded to do so by http requests.

Or at least, that is my imagination from the configuration and the descriptions I've seen so far, which means it might as well be totally off. Until recently I didn't even know that we're now using kafka but was assuming it's redis and the 'jobrunner' script, because nothing in [[wikitech:job queue]], [[wikitech:jobrunner]], [[wikitech:redis]] mentioned it (I've now partially changed that, but edits would be welcome).

refreshLinks2 is not used anymore. Since it is not in $wgJobClasses anymore, they probably won't get cleared in recycleAndDeleteStaleJobs().

(that is if JobQueueDB was used to begin with).

I'm not so familiar with the Kafka system (only the basic concept).

@EddieGP @aaron Thanks for the information. Still the job queue is, as far as I can see, totally broken for all sites and that's not okay and I'd dare to say it prevents the usability of the Beta Cluster as a production-like environment for testing and developing software. Who can we beg to have a look at this? GlobalRename is broken there, to put an example. Adding RelEng here as deployment-prep is their environment.

WRT the job table and its truncation. I wouldn't touch it. I'd say we should do the same for all other deployment-prep wikis and of course I assume there are replicas of those tables that should be cleaned. I know nothing about that.

Regards.

In T192473#4163154, aaron wrote:

I'm not so familiar with the Kafka system (only the basic concept).

This. I think from the digging I did a few days ago I now understand the basic concept, but I still have no clue where to look for the problem, although I'd love to help with that. Adding Services, afaik kafka/changeprop is their expertise.

I think @mobrovac knows about Kafka as well, and helped the last week resolve an issue on the cpjobqueue server. Adding him here as well. Thanks.

It seems this task got derailed completely from the original purpose.

The original issue regarding ObjectCache doesn't seem to be related to the job queue at all.

The jobs table is not used as far as I know any more for a long long time and the stuff in there I would guess is just a remaining from the previous transition from DB queue to Redis queue.

The Kafka queue works correctly overall as far as I can tell.

jobqueue at beta is down again; see https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:GlobalRenameProgress?username=Hauskatze

This link shows that the rename process was correctly finished.

Is there anything left to do here? Can we close this task and file new, more precise tasks for any issues remained that were discussed here? This one became impossible to navigate.

The Kafka queue works correctly overall as far as I can tell.

kafka-04 ran out of disk space because the syslog was flooded with errors, the jobs piled up and the link you quoted indeed showed everything got stuck. It might very well be that this was fixed by @Ottomata when replacing deployment-kafka0[45] with deployment-kafka-main-[12] last week.

EddieGP claimed this task.

This link shows that the rename process was correctly finished.

I unblocked it with a script. I've tested renames today on Beta and they all work now.