Outreach Dashboard has been having recurring outages
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Xaosflux
	Jan 27 2021, 3:13 PM

Description

Root cause unknown, server restart has been clearing the outages

*T273048
*T252392

500 Internal Service Error for Outreach Dashboard
https://outreachdashboard.wmflabs.org/

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T273067 Outreach Dashboard has been having recurring outages
		Resolved	PRODUCTION ERROR	Ragesoss	T280922 outreachdashboard failing with 500 Internal Server Error

Event Timeline

Xaosflux created this task.Jan 27 2021, 3:13 PM

Restricted Application added subscribers: Base, Aklapper. · View Herald TranscriptJan 27 2021, 3:13 PM

Xaosflux mentioned this in T273048: 500 Internal Service Error for Outreach Dashboard.Jan 27 2021, 3:14 PM

Xaosflux added a subscriber: Ragesoss.

Admins from (https://openstack-browser.toolforge.org/project/globaleducation) added as subscribers, if this isn't something you can help with apologies for extra notifications.

Xaosflux updated the task description. (Show Details)Jan 27 2021, 3:35 PM

JJMC89 edited projects, added Cloud-VPS; removed Cloud-Services.Jan 27 2021, 4:49 PM

JeroenDeDauw unsubscribed.Jan 28 2021, 5:36 PM

taavi mentioned this in T277651: Outreachdashboard.wmflabs.org is down frequently with database problems.Mar 17 2021, 11:25 AM

I've been working with Rails performance expert Nate Berkopec this month, and we've come to a clearer understanding of the usual cause of recurring outages. In most cases (except for the most recent one), the system gets overloaded through the combination of the background processes (which are responsibile for updating stats for courses and tie up a significant portion of system resources on essentially a continual basis) and multiple users requesting particularly slow pages (ie, ones that load a lot of data). The web server uses up to 6 "Passenger" processes that handle a web request by sending it to Ruby on Rails application. If all 6 of these are tied up with very slow requests, then the system can't process any more requests in the meantime.

The manual_update route for doing an immediate stats update for a program was one major source of slow requests, as a course with highly active users or a very long tracking period can take minutes or even hours to update. The manual_update route would attempt an update as a web request instead of a background job, meaning it would tie up one web process until it completed. Although this route isn't accessible via any UI elements, I had told a number of program organizers about it at times when they had urgent need for updated stats, and word may have spread further; it was being used very regularly. I've restricted that feature to super admin users.

The main remaining culprits for tying up web processes are the explore index, which is simply trying to do too complex of a summary of stats across the system and isn't being cached aggressively enough, and the Articles and Users lists for campaigns, which are loading too much data too inefficiently. These should be fixable by reworking the slow pages to load less data at once, query more efficiently, and/or save cached data for longer.

These situations are much more likely to tie up all the processes when many people are using the Dashboard at once, so downtime has often occurred during weekends busy with editathons.

The background jobs and the capacity to provide updated stats for so many resource-intensive programs is the biggest bottleneck of the system, so we're currently prioritizing an archicture change to spread the system across multiple servers. With one server for web requests, another for the database, and another for background jobs, it should become far more resilient to periods of high simultaneous use.

Ragesoss closed subtask T280922: outreachdashboard failing with 500 Internal Server Error as Resolved.Apr 23 2021, 1:21 PM

taavi removed a project: Cloud-VPS.Tue, Jun 4, 4:16 PM

Outreach Dashboard has been having recurring outagesOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Outreach Dashboard has been having recurring outages
Open, Needs TriagePublic
Actions

Related Objects
Search...