Page MenuHomePhabricator

Outreach Dashboard has been having recurring outages
Open, Needs TriagePublic

Description

Root cause unknown, server restart has been clearing the outages

*T273048
*T252392

500 Internal Service Error for Outreach Dashboard
https://outreachdashboard.wmflabs.org/

Event Timeline

Admins from (https://openstack-browser.toolforge.org/project/globaleducation) added as subscribers, if this isn't something you can help with apologies for extra notifications.

I've been working with Rails performance expert Nate Berkopec this month, and we've come to a clearer understanding of the usual cause of recurring outages. In most cases (except for the most recent one), the system gets overloaded through the combination of the background processes (which are responsibile for updating stats for courses and tie up a significant portion of system resources on essentially a continual basis) and multiple users requesting particularly slow pages (ie, ones that load a lot of data). The web server uses up to 6 "Passenger" processes that handle a web request by sending it to Ruby on Rails application. If all 6 of these are tied up with very slow requests, then the system can't process any more requests in the meantime.

The manual_update route for doing an immediate stats update for a program was one major source of slow requests, as a course with highly active users or a very long tracking period can take minutes or even hours to update. The manual_update route would attempt an update as a web request instead of a background job, meaning it would tie up one web process until it completed. Although this route isn't accessible via any UI elements, I had told a number of program organizers about it at times when they had urgent need for updated stats, and word may have spread further; it was being used very regularly. I've restricted that feature to super admin users.

The main remaining culprits for tying up web processes are the explore index, which is simply trying to do too complex of a summary of stats across the system and isn't being cached aggressively enough, and the Articles and Users lists for campaigns, which are loading too much data too inefficiently. These should be fixable by reworking the slow pages to load less data at once, query more efficiently, and/or save cached data for longer.

These situations are much more likely to tie up all the processes when many people are using the Dashboard at once, so downtime has often occurred during weekends busy with editathons.

The background jobs and the capacity to provide updated stats for so many resource-intensive programs is the biggest bottleneck of the system, so we're currently prioritizing an archicture change to spread the system across multiple servers. With one server for web requests, another for the database, and another for background jobs, it should become far more resilient to periods of high simultaneous use.