Maniphest T192473

deployment-prep has jobqueue issues
Closed, ResolvedPublic
Actions

Description

Millions (literally) of error messages like Notice: Undefined index: channels in /srv/mediawiki/php-master/includes/objectcache/ObjectCache.php on line 340 since about 2018-04-17/18 midnight UTC; which is apparently breaking T192471 and another global rename I was requested to perform today. Thanks.

https://logstash-beta.wmflabs.org/goto/d4ac7b432beed9503464aa12e97930e1

Details

	Subject	Repo	Branch	Lines +/-
	Avoid warnings in ObjectCache::newWANCacheFromParams if "channels" is not set"	mediawiki/core	master	+5 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	EddieGP	T192473 deployment-prep has jobqueue issues
Resolved	EddieGP	T192471 Please unblock stuck global rename Artix Kreiger 2 to Artix Krieger 2
Resolved	EddieGP	T192476 Please unblock stuck global rename Samtar → There'sNoTime

Event Timeline

MarcoAurelio created this task.Apr 18 2018, 5:20 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2018, 5:20 PM

MarcoAurelio triaged this task as Unbreak Now! priority.Apr 18 2018, 5:20 PM

MarcoAurelio added a project: Wikimedia-production-error.

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptApr 18 2018, 5:20 PM

MarcoAurelio added a subtask: T192471: Please unblock stuck global rename Artix Kreiger 2 to Artix Krieger 2.Apr 18 2018, 5:21 PM

EddieGP removed a project: Wikimedia-production-error.Apr 18 2018, 5:26 PM

MarcoAurelio updated the task description. (Show Details)Apr 18 2018, 5:26 PM

Also: Warning: Invalid argument supplied for foreach() in /srv/mediawiki/php-master/includes/objectcache/ObjectCache.php on line 343

@hoo and @Rxy suggests this may be nutcracker refusing to stay up and running. Granted, when I sudo puppet agent -tv it stays up for a minute or two, then goes down again.

hoo@deployment-jobrunner03:~$ sudo tail /var/log/nutcracker/nutcracker.log
[2018-04-18 21:18:53.606] nc_proxy.c:148 bind on p 11 to addr '/var/run/nutcracker/nutcracker.sock 0666' failed: No such file or directory
…

MarcoAurelio renamed this task from Notice: Undefined index: channels in /srv/mediawiki/php-master/includes/objectcache/ObjectCache.php on line 340 to redis/nutcracker down on deployment-prep.Apr 18 2018, 9:44 PM

MarcoAurelio updated the task description. (Show Details)

MarcoAurelio added a subtask: T192476: Please unblock stuck global rename Samtar → There'sNoTime.Apr 18 2018, 9:51 PM

nutcracker now starts on deployment-jobrunner03, see my comment on T178457. The error in the task description still happens though.

@aaron merged 8ad186728 yesterday at 23:51 UTC.
wmf-config update on deployment-tin was finished at 23:58 UTC
scap started at 23:58 UTC and finished at 00:04 UTC
according to logstash the error was first observed at 00:03UTC and stayed at it's high level ever since

That patch change $wgMainWANCache. Both by timing and content of it I think it's very likely to be reason we see tons of Notice: Undefined index: channels in /srv/mediawiki/php-master/includes/objectcache/ObjectCache.php on line 340, which in turn is probably the reason the job queue is broken.

Restricted Application added a project: Product-Analytics. · View Herald TranscriptApr 18 2018, 11:58 PM

EddieGP removed a project: Product-Analytics.Apr 19 2018, 12:09 AM

EddieGP renamed this task from redis/nutcracker down on deployment-prep to deployment-prep has jobqueue/caching issues.Apr 19 2018, 12:17 AM

EddieGP updated the task description. (Show Details)

Change 427590 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Avoid warnings in ObjectCache::newWANCacheFromParams if "channels" is not set"

https://gerrit.wikimedia.org/r/427590

gerritbot added a project: Patch-For-Review.Apr 19 2018, 4:40 AM

The warnings are pointless, the patch above adds an isset() check.

EddieGP closed subtask T192471: Please unblock stuck global rename Artix Kreiger 2 to Artix Krieger 2 as Resolved.Apr 19 2018, 8:10 AM

EddieGP closed subtask T192476: Please unblock stuck global rename Samtar → There'sNoTime as Resolved.Apr 19 2018, 8:20 AM

Per aarons comment, just logspam. Seems the actual problem for renames was nutcracker, which I fixed in T192473#4141169. Both are now unstuck.

👍

jobqueue at beta is down again; see https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:GlobalRenameProgress?username=Hauskatze

jobs are not being queued and showJobs.php always output '0'

Change 427590 merged by jenkins-bot:
[mediawiki/core@master] Avoid warnings in ObjectCache::newWANCacheFromParams if "channels" is not set"

https://gerrit.wikimedia.org/r/427590

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)).Apr 19 2018, 11:00 AM

The error mentioned is gone, thanks. However we still have issues: T192604. Thanks.

Indeed, the jobqueue on beta is still broken, although it's unrelated to the logspam.
Thanks to Antoine, I now know that jobs are queued in kafka. Fiddling a bit on deployment-kafka04 gives me the impression that all the jobs made it there and are sitting happily in the queue. Somehow they don't get consumed. By the description of the profile::cpjobqueue I think deployment-cpjobqueue consumes the messages from kafka and triggers the jobs on deployment-jobrunner03. There's some errors on deployment-cpjobqueue in logstash, I might take a look on those tomorrow - or rather, later today in my TZ.

EddieGP renamed this task from deployment-prep has jobqueue/caching issues to deployment-prep has jobqueue issues.Apr 20 2018, 10:00 PM

@EddieGP deployment-cpjobqueue puppet was broken due to disk full; this was fixed today at T193127. However since March 20 logstash-beta is not recording any message, which is being tracked at T193133 and, of course, global rename and all tasks depending on job queue are still not running/broken :-(

Who can we approach to have the jobqueue issues resolved, @greg ?

Thanks.

Also, there are stalled jobs in the job table:

wikiadmin@deployment-db04[deploymentwiki]> select count(*) from job;
+----------+
| count(*) |
+----------+
|       31 |
+----------+
1 row in set (0.00 sec)

wikiadmin@deployment-db04[deploymentwiki]> select job_id, job_cmd, job_timestamp from job;
+--------+---------------+----------------+
| job_id | job_cmd       | job_timestamp  |
+--------+---------------+----------------+
|      1 | refreshLinks2 | 20121102235108 |
|      2 | enotifNotify  | 20121105052726 |
|      3 | enotifNotify  | 20121121052728 |
|      4 | enotifNotify  | 20121202013726 |
|      5 | refreshLinks2 | 20121227143406 |
|      6 | refreshLinks2 | 20121227143638 |
|      7 | refreshLinks2 | 20121227143816 |
|      8 | refreshLinks2 | 20121227143908 |
|      9 | refreshLinks2 | 20121227143945 |
|     10 | refreshLinks2 | 20121227152308 |
|     11 | refreshLinks2 | 20121227152442 |
|     12 | refreshLinks2 | 20121227152521 |
|     13 | refreshLinks2 | 20121227152551 |
|     14 | refreshLinks2 | 20121227152639 |
|     15 | refreshLinks2 | 20121227152749 |
|     16 | refreshLinks2 | 20121227152849 |
|     17 | refreshLinks2 | 20121227152935 |
|     18 | refreshLinks2 | 20121227152959 |
|     19 | refreshLinks2 | 20121227153406 |
|     20 | refreshLinks2 | 20121227153436 |
|     21 | refreshLinks2 | 20121227154227 |
|     22 | refreshLinks2 | 20121227154257 |
|     23 | refreshLinks2 | 20121227154307 |
|     24 | refreshLinks2 | 20121227154335 |
|     25 | refreshLinks2 | 20121227154401 |
|     26 | refreshLinks2 | 20121227154423 |
|     27 | refreshLinks2 | 20121227154509 |
|     28 | refreshLinks2 | 20121227160211 |
|     29 | refreshLinks2 | 20121227160303 |
|     30 | enotifNotify  | 20130226015129 |
|     31 | enotifNotify  | 20130409222941 |
+--------+---------------+----------------+
31 rows in set (0.00 sec)

I don't think years-old jobs should be there as per https://www.mediawiki.org/wiki/Manual:Job_table

Once upon a time, appservers would insert jobs into a database table and jobrunner servers would read 'em from there.

But then the great rise of redis began (or was there another step in between?), and appservers would push jobs into redis storage, and a script called 'jobrunner' would pull them and execute them on servers called jobrunners. And when switching away from db-based storage for jobs, some old jobs apparently just were left behind. At this point I can hardly imagine anyone still cares about them any more, the table could as well be truncated.

Now there is kafka, and app servers (like deployment-mediawiki-07) insert jobs in to 'topics' there, and some other host (deployment-cpjobqueue) fetches the job description from these 'topic' and creates a http request with that description. And the http request triggers the execution of the job on the servers called 'jobrunner' (deployment-jobrunner02), while the script called 'jobrunner' is no longer involved at all.

When you execute runJobs.php or showJobs.php you don't see anything, because those scripts cannot query the queue, neither for jobs nor for the queue size. mediawikis kafka storage for jobqueue is push-only ( https://wikitech.wikimedia.org/wiki/Kafka_Job_Queue ), which means that mediawiki itself does never access the queue but only executes jobs when commanded to do so by http requests.

Or at least, that is my imagination from the configuration and the descriptions I've seen so far, which means it might as well be totally off. Until recently I didn't even know that we're now using kafka but was assuming it's redis and the 'jobrunner' script, because nothing in [[wikitech:job queue]], [[wikitech:jobrunner]], [[wikitech:redis]] mentioned it (I've now partially changed that, but edits would be welcome).

refreshLinks2 is not used anymore. Since it is not in $wgJobClasses anymore, they probably won't get cleared in recycleAndDeleteStaleJobs().

(that is if JobQueueDB was used to begin with).

I'm not so familiar with the Kafka system (only the basic concept).

@EddieGP @aaron Thanks for the information. Still the job queue is, as far as I can see, totally broken for all sites and that's not okay and I'd dare to say it prevents the usability of the Beta Cluster as a production-like environment for testing and developing software. Who can we beg to have a look at this? GlobalRename is broken there, to put an example. Adding RelEng here as deployment-prep is their environment.

WRT the job table and its truncation. I wouldn't touch it. I'd say we should do the same for all other deployment-prep wikis and of course I assume there are replicas of those tables that should be cleaned. I know nothing about that.

Regards.

In T192473#4163154, aaron wrote:

I'm not so familiar with the Kafka system (only the basic concept).

This. I think from the digging I did a few days ago I now understand the basic concept, but I still have no clue where to look for the problem, although I'd love to help with that. Adding Services, afaik kafka/changeprop is their expertise.

I think @mobrovac knows about Kafka as well, and helped the last week resolve an issue on the cpjobqueue server. Adding him here as well. Thanks.

It seems this task got derailed completely from the original purpose.

The original issue regarding ObjectCache doesn't seem to be related to the job queue at all.

The jobs table is not used as far as I know any more for a long long time and the stuff in there I would guess is just a remaining from the previous transition from DB queue to Redis queue.

The Kafka queue works correctly overall as far as I can tell.

jobqueue at beta is down again; see https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:GlobalRenameProgress?username=Hauskatze

This link shows that the rename process was correctly finished.

Is there anything left to do here? Can we close this task and file new, more precise tasks for any issues remained that were discussed here? This one became impossible to navigate.

In T192473#4171334, @Pchelolo wrote:

The Kafka queue works correctly overall as far as I can tell.

kafka-04 ran out of disk space because the syslog was flooded with errors, the jobs piled up and the link you quoted indeed showed everything got stuck. It might very well be that this was fixed by @Ottomata when replacing deployment-kafka0[45] with deployment-kafka-main-[12] last week.

EddieGP closed this task as Resolved.May 1 2018, 4:46 PM

EddieGP claimed this task.

This link shows that the rename process was correctly finished.

I unblocked it with a script. I've tested renames today on Beta and they all work now.

Liuxinyu970226 unsubscribed.May 2 2018, 7:56 AM

deployment-prep has jobqueue issuesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

deployment-prep has jobqueue issues
Closed, ResolvedPublic
Actions

Related Objects
Search...