Page MenuHomePhabricator

outreachdashboard is down 2025-02-03
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

  • Site does not load

What should have happened instead?:
*Site should load

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):
WMF Production

Event Timeline

Now responding, no notes on what restored

Seems to be down entirely right now. @Ragesoss is this something you still hope to maintain?

Yes, I'm very close to deploying a major update intended to get the system into a stable state. In the meantime, I will continue restarting the instance that runs the web server when it goes down (which typically happens because the database server is overloaded because of non-web requests from data updates).

Yes, I'm very close to deploying a major update intended to get the system into a stable state. In the meantime, I will continue restarting the instance that runs the web server when it goes down (which typically happens because the database server is overloaded because of non-web requests from data updates).

Hi @Ragesoss, we talked a bit about the Dashboard in Connection-Team today. Do you have an estimate of how stable the system will be after your changes? Or alternatively, do you have a timeline for those changes, that I could then compare with the rate of failed updates on our side? Depending on this, we might want to rework the integration on our side so that it's more error-tolerant.

@Daimona the goal will be to get it pretty much completely stable. The Wiki Education instance (dashboard.wikiedu.org) has had ~100% uptime for more than year, and I think it's realistic to get very close to that for Programs & Events Dashboard. (The setup on labs is a little more complicated with system components spread across multiple instances, and also subject to occasional WM Cloud-wide issues, so it will probably not be quite 100%, but I want to get very close.) This update addresses the root cause of the current instability situation (which is that updates for very large events are too demanding of the database), but there are likely to be secondary bottlenecks that reveal themselves once we get it into production, so there might be some further work to do to get it really stable. The update will completely eliminate the bottleneck queries against the Dashboard's revisions database, which are the main culprits, so if all goes well, even if large events still cause problems in terms of the update cycle, it should stop taking down the web server along with it.

Timeline:

  • Deploying the changes to dashboard.wikiedu.org this week, and monitoring for problems.
  • Deploying to outreachdashboard.wmflabs.org as soon as next week, if everything thing seems to be okay.

I will start posting updates to T273067 as we begin on Programs & Events Dashboard.

Thank you for the explanation and context :) I'm eager to see how it will play out! I've subscribed to the other task and will keep an eye out on our side. Do let me know if there's anything I can do on the MW side to help you!

Hi all,

I'm having issues (as well as my co-host in a different location using different wifi) loading the https://outreachdashboard.wmflabs.org/ page. How goes the progress of getting the dashboard stable? Is this issue intermittent? We are hosting a wiki editathon tomorrow and want to make sure everything goes smoothly.

Edit: it is now back up!