|operations/puppet : production||toolsdb: fix up the config for the new server|
- Mentioned In
- T216749: Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet
T216507: Move Phragile DB to CloudVPS
T216441: Evaluate transferring the non-replicated tables to the new toolsdb server
T216167: Verify checkwiki tool against excessive DB usage
T216328: CopyPatrol - 500 - Internal Server Error
T216320: Cannot connect to toolsdb
T214278: Quickstatements, "backend is overloaded"
T216213: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues
T216170: toolsdb - Per-user connection limits
- Mentioned Here
- T216441: Evaluate transferring the non-replicated tables to the new toolsdb server
T216170: toolsdb - Per-user connection limits
Hi thanks for letting us know. The Phragile app is running on it's own VPS now and I'm not sure why we kept the DB on the tools server ( it's currently accessed from the VPS instance ). I'm trying to move the DB to the VPS so we can hopefully solve that problem and reduce the load.
The problem has returned this morning. Rising user connections is a symptom of all tables becoming unusable and not the cause. WMCS is proceeding to try and stand up a new server to move to quickly.
Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.
Please keep your comments factual and leave out the attitude. Just subscribe to cloud-announce list ( https://lists.wikimedia.org/pipermail/cloud-announce/2019-February/thread.html ) so you know what is happening.
No, I'll not be doing that, any more than you just have. Whilst a serious outage is being treated merely as a curiosity, it is well worth reminding WMF employees that outages like this are far from acceptable.
I can give a guess-estimate. Given the complexity of some of the operations we are doing (specially to prevent serious data-loss), services probably won't be fully recovered until at least Tuesday next week (2019-02-26).
Wikimedia Cloud Services is not an outage-free zone. Perhaps that should change, but then you really should be advocating for a lot more ressources for it. With its current resources the team does an amazing job serving the communities of developers and users.
On the original "no comment" remark: it really shows you are not informing yourself in the least. There have been multiple announcements on the mailing lists, repeated responses in #wikimedia-cloud freenode, several tasks on phabricator and a few incident reports on Wikitech. All being the preferred channels for communication and documentation for Wikimedia Cloud Services things. You may have missed those, as most users probably have, but that does not give you leave to assume they did not happen or to attack the people and services of the Cloud Services Team.
The team made progress on getting the new server up yesterday and overnight. If we manage to get things live again before Tuesday, we'll get the word out! If not, I think things are moving along well to getting toolsdb back "working" by then. Thanks for your patience everyone!
If I am not mistaken its a requirement to at least be subscribed to -announce, if not the primary list itself. You're complaining about a couple days (maybe a week total) outage. I remember the days when the database had over a year lag. Things used to break regularly and that was the norm. As users of the free hosting/services we understood issues arise. If you cannot deal with issues once a year or so now a days, you might look at investing in moving your projects to a dedicated hosting service with a SAL and guaranteed 100% uptime clause. Otherwise understand that there are communication channels in place and that issues come up.
Can you clarify - did you really mean 2019-02-26, or is the estimate actually for tomorrow (2019-02-19)? Also (if it is the extended period of an additional week) it would be nice to have a brief summary of the steps that are being taking that explain why it would take so long - you are replacing hardware? Copying and verifying data? Working on improved configuration of some sort?
Repeating from cloud-announce: The change-over was successful. Thank you for your patience and understanding. ToolsDB should now be fully operational. Please report issues to the #wikimedia-cloud IRC channel.
T216441 is to track further recovery efforts around tables that are not replicated.
FYI: I just moved the Phragile database aka s52552__phragile to it's own instance on CloudVPS and the app uses that DB now. When I get the ok from my colleagues I will remove it from ToolsDB.
The only things still open under this are hardware that is in the process of decommissioning. I think all of our toolsdb action items from the outage are completed (not to say that there isn't still work to do).