Page MenuHomePhabricator

Toolforge: sudden issues in both gridengine and k8s webservices
Closed, ResolvedPublic

Description

Today 2019-08-05 at 09:02 UTC we got 2 pages from toolschecker:

11:02 <+icinga-wm> PROBLEM - toolschecker: gridengine webservice running #page on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/gridengine - 177 bytes in 9.766 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin

11:04 <+icinga-wm> PROBLEM - toolschecker: kubernetes webservice running #page on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 177 bytes in 9.879 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker

Event Timeline

At first sight both k8s and gridengine webservices look fine. Indeed, there was an issue with toolschecker-related webservices. I'm restarting them by hand to see what happens.

Mentioned in SAL (#wikimedia-cloud) [2019-08-05T09:30:30Z] <arturo> root@tools-checker-03:~# toolscheckerctl restart (T229787)

Mentioned in SAL (#wikimedia-cloud) [2019-08-05T09:39:03Z] <arturo> root@tools-checker-03:~# toolscheckerctl restart again (T229787)

aborrero triaged this task as Medium priority.Aug 5 2019, 10:00 AM
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

I could normally operate both grid webservices and k8s webservices. There is no apparent reason for this issue.
Toolschecker didn't like that I managed the webservices on my own, so I had to stop them, and restart all the webservices again using toolscheckerctl restart.

Leaving the task open for a bit more time in case anyone has more ideas.

JHedden claimed this task.

The icingia check description was recently updated for T228878 https://gerrit.wikimedia.org/r/c/operations/puppet/+/525536 . The new name/description for this service appears to have removed the existing ack's and downtime.

Closing this ticket based on that information. T221301 is tracking the ongoing work for these checks.