Page MenuHomePhabricator

Investigate Tool Labs webservice outage on 2016-05-25
Closed, ResolvedPublic

Description

Loading any page on the Tool Labs domain results in: "503 Service Temporarily Unavailable"

As far as I can tell this is just Tool Labs and not other parts of Wikimedia Labs.

Details

Related Gerrit Patches:

Event Timeline

Harej created this task.May 25 2016, 5:06 AM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptMay 25 2016, 5:06 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
Harej triaged this task as Unbreak Now! priority.May 25 2016, 5:06 AM
Restricted Application added subscribers: Luke081515, TerraCodes, Urbanecm. · View Herald TranscriptMay 25 2016, 5:06 AM
Harej added a comment.May 25 2016, 5:30 AM

To clarify, this is not just happening with one tool. Seemingly each tool I try to load results in the same error.

URLs tested:

https://tools.wmflabs.org/urbanecmbot/reliktyCswiki/ wasn't working a few seconds ago, after restarting with webservice restart (SSH is working) it works. Also my second tool (https://tools.wmflabs.org/missingpages) wasn't working before restarting and now (after restarting) it works.

@Labsadmins: Please restart all webservices, I think that it should fix it.

Urbanecm lowered the priority of this task from Unbreak Now! to High.May 25 2016, 6:55 AM

Lowering the priority because Tool Labs is working, so this task is about finding why Tool Labs wasn't accessable.

yuvipanda renamed this task from Tool Labs appears to be down to Investigate Tool Labs webservice outage on 2016-05-25.May 25 2016, 7:00 AM

Change 290681 had a related patch set uploaded (by Rush):
tools.checker continually watch for webservices

https://gerrit.wikimedia.org/r/290681

https://etherpad.wikimedia.org/p/tools-web-outage-2016-05-25 for some ad-hoc notes on what happened.

Thanks for responding and following up. I added an actionable about the stuck mount contributing factor to the etherpad. Don't forget to create an incident project and associated incident page. It seems like in addition to the puppet nag emails being active for tools we should better watch the webservices portion of Tools. Check logic exists now that is unfortunately failing atm

http://checker.tools.wmflabs.org/service/start
NOT OK

What do you think about fixing that and merging https://gerrit.wikimedia.org/r/#/c/290681/?

scfc added a subscriber: scfc.May 26 2016, 12:25 AM

I don't think this deserves major action, and T136168 can't prevent it. If software needs to be updated on multiple hosts, regardless of any deployment method, there will be a time frame where host A will run version X and host B will run version X + 1. Therefore whether it's MediaWiki or some other application, good practice is to make version X + 1 backwards-compatible with version X, deploy X + 1, and once that is running everywhere, the compatibility mode can be disabled/dropped aka new features can be used.

We've used that process in the past, here we missed it once, so let's make a mental note and move on.

valhallasw moved this task from Triage to Backlog on the Toolforge board.May 27 2016, 11:08 AM

Change 290681 merged by Rush:
tools.checker continually watch for webservices

https://gerrit.wikimedia.org/r/290681

yuvipanda closed this task as Resolved.Jul 5 2016, 12:17 PM
yuvipanda claimed this task.