Page MenuHomePhabricator

Various user visible errors in Cloud VPS projects following OpenStack upgrade on 2019-10-07
Open, NormalPublic

Description

The cloud-services-team planned, announced, and executed an upgrade of OpenStack components from the "Mitaka" version to the "Newton" version on 2019-10-07. After upgrading the software and related database schemas, various issues were found with instances in Cloud VPS projects. This ticket is tracking some of those issues, but is not the primary location that the Cloud Services SRE team is using to coordinate investigation and correction of the issues.


Original report:

500 internal server error on tools.wmflabs.org and all CI is dead

All tools.wmflabs.org tools are currently down with web services returning HTTP error 500

Not seen any alerts on wikitech to say it's planned so filing

Event Timeline

RhinosF1 created this task.Mon, Oct 7, 4:20 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Oct 7, 4:20 PM
RhinosF1 triaged this task as Unbreak Now! priority.Mon, Oct 7, 4:21 PM

Boldly setting to UBN!

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptMon, Oct 7, 4:21 PM
bd808 claimed this task.Mon, Oct 7, 4:32 PM
bd808 added a subscriber: bd808.

We are actively working on this. The OpenStack upgrade we planned for this morning has resulted in some networking issues that we are still attempting to work through.

Jdforrester-WMF renamed this task from 500 internal server error on tools.wmflabs.org to 500 internal server error on tools.wmflabs.org and all CI is dead.Mon, Oct 7, 4:36 PM
RhinosF1 updated the task description. (Show Details)Mon, Oct 7, 4:38 PM
bd808 added a comment.Mon, Oct 7, 4:38 PM

It appears that some Tools (most?) tools were actually working, but the 'admin' tool that serves up both https://tools.wmflabs.org/ and https://tools.wmflabs.org/admin/ was not working.

aborrero lowered the priority of this task from Unbreak Now! to High.Mon, Oct 7, 4:46 PM

It appears that some Tools (most?) tools were actually working, but the 'admin' tool that serves up both https://tools.wmflabs.org/ and https://tools.wmflabs.org/admin/ was not working.

https://tools.wmflabs.org/versions/ wm-bot and Zppix-Bot were also down when I filed this.

Only wm-bot is back as of now.

tools.wmflabs.org/ is up though for me so that doesn't make sense

Confirming that CI is now fixed. Thanks!

bd808 added a comment.Mon, Oct 7, 5:06 PM

Various issues we have seen/are working on:

  • Neutron (software defined network for OpenStack instances) was routing traffic between Cloud VPS instances and bare meta hosts (internal Wikimedia Foundation network) using a different source ip than expected after software updates
    • The different source IP stopped a lot of traffic from being accepted due to local firewall settings for various services not allowing the new IP
    • Restarting the Neutron services seems to have corrected this and started routing traffic via the expected IP
  • Communications with DNS recursors, LDAP directory servers, and NFS servers were all affected by the Neutron IP issue
  • Rolling restarts of instances in Toolforge are happening to correct the NFS mounting issues.
Zppix added a subscriber: Zppix.Mon, Oct 7, 5:27 PM

Noting that all tools that I've seen as down are now back (one needed it's processes killing so they can restart by a maintainer)

bd808 renamed this task from 500 internal server error on tools.wmflabs.org and all CI is dead to Various user visible errors in Cloud VPS projects following OpenStack upgrade on 2019-10-07.Mon, Oct 7, 5:30 PM
bd808 updated the task description. (Show Details)
RhinosF1 updated the task description. (Show Details)Mon, Oct 7, 5:32 PM
SQL added a subscriber: SQL.Mon, Oct 7, 5:37 PM
Krenair added a subscriber: Krenair.Mon, Oct 7, 6:49 PM
bd808 removed bd808 as the assignee of this task.Mon, Oct 7, 8:31 PM
bd808 lowered the priority of this task from High to Normal.
bd808 moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.

Core infrastructure appears to be working as intended. Known instances with NFS mount failures have been rebooted. Toolforge seems to be working normally. Some tools may still need individual restarts, but we tried to force a restart of Kubernetes pods and other long running jobs that should have fixed many of them.

This incident will need some documentation follow up and analysis for ways that we could try to avoid unintended breakage in the future. I am going to remove my self as assigned for now, but also place this ticket in a queue for discussion at the next WMCS team meeting.

bd808 assigned this task to Andrew.Mon, Oct 7, 11:29 PM
bd808 added a subscriber: Andrew.