Page MenuHomePhabricator

Job runners throw lots of "Can't connect to MySQL server" exceptions
Closed, DeclinedPublic

Description

This is a new issue happening since 2015-12-15 between 1PM and 7PM. There is an increase in traffic in serveral mysql servers, in particular parsercache and s4 (commons) servers, which leads to a higher amount of failed connections.

From the initial report:

Also happening about once every few minutes for parsercache servers
Last 3 hours in requests to /rpc/RunJobs.php on mediawiki-errors dashboard in logstash:

12x Error connecting to 10.64.16.157: Can't connect to MySQL server on '10.64.16.157' (4)	 - pc1002
12x Error connecting to 10.64.16.156: Can't connect to MySQL server on '10.64.16.156' (4)	 - pc1001
11x Error connecting to 10.64.16.158: Can't connect to MySQL server on '10.64.16.158' (4)	 - pc1003

Event Timeline

jcrespo created this task.Dec 16 2015, 10:35 AM
jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added projects: WMF-JobQueue, DBA.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 16 2015, 10:35 AM
jcrespo set Security to None.
jcrespo removed a subscriber: Wikimedia-production-error.
aaron added a subscriber: aaron.Dec 17 2015, 12:35 AM

Probably related to T121549.

Are job runners configured with the same 3 second connection timeout than regular HTTP application servers? That would explain the connections issues.

Probably related to T122069, or both having the same cause (the start dates match).

I think jobrunners do not use a separate mediawiki slave group, they should, so we can isolate them and give more/less resources, as needed. They could also require special partitioning.

Krinkle renamed this task from Spikes of job runner new connection errors to mysql to Job runners throw lots of "Can't connect to MySQL server" exceptions.Mar 24 2016, 7:52 PM
jcrespo closed this task as Declined.Apr 26 2017, 12:45 PM

Not happening in a long time- but nothing was technically done. Resolved or not needed anymore.

@jcrespo Hm... I still see them in logstash, though? https://logstash.wikimedia.org/goto/f0ef96a0fa857abcf55b4ff5b9d4f7ee

Both from page views and from job runners. Last 12 hours:

  • en.wikipedia.org: 25,000
  • 127.0.0.1 (job runners): 1,900
  • de.wikipedia.org: 9