Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +1 -0 | toolsdb: fix up the config for the new server |
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2019-02-15T06:40:58Z] <marostegui> Stop puppet on labsdb1005 to leave "max_user_connections" on my.cnf - T216170 T216208
Hi thanks for letting us know. The Phragile app is running on it's own VPS now and I'm not sure why we kept the DB on the tools server ( it's currently accessed from the VPS instance ). I'm trying to move the DB to the VPS so we can hopefully solve that problem and reduce the load.
The problem has returned this morning. Rising user connections is a symptom of all tables becoming unusable and not the cause. WMCS is proceeding to try and stand up a new server to move to quickly.
It is exposing some interesting things and best practices that are missed in places, nonetheless!
Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.
Please keep your comments factual and leave out the attitude. Just subscribe to cloud-announce list ( https://lists.wikimedia.org/pipermail/cloud-announce/2019-February/thread.html ) so you know what is happening.
No, I'll not be doing that, any more than you just have. Whilst a serious outage is being treated merely as a curiosity, it is well worth reminding WMF employees that outages like this are far from acceptable.
I can give a guess-estimate. Given the complexity of some of the operations we are doing (specially to prevent serious data-loss), services probably won't be fully recovered until at least Tuesday next week (2019-02-26).
Wikimedia Cloud Services is not an outage-free zone. Perhaps that should change, but then you really should be advocating for a lot more ressources for it. With its current resources the team does an amazing job serving the communities of developers and users.
On the original "no comment" remark: it really shows you are not informing yourself in the least. There have been multiple announcements on the mailing lists, repeated responses in #wikimedia-cloud freenode, several tasks on phabricator and a few incident reports on Wikitech. All being the preferred channels for communication and documentation for Wikimedia Cloud Services things. You may have missed those, as most users probably have, but that does not give you leave to assume they did not happen or to attack the people and services of the Cloud Services Team.
Quick report:
The team made progress on getting the new server up yesterday and overnight. If we manage to get things live again before Tuesday, we'll get the word out! If not, I think things are moving along well to getting toolsdb back "working" by then. Thanks for your patience everyone!
If I am not mistaken its a requirement to at least be subscribed to -announce, if not the primary list itself. You're complaining about a couple days (maybe a week total) outage. I remember the days when the database had over a year lag. Things used to break regularly and that was the norm. As users of the free hosting/services we understood issues arise. If you cannot deal with issues once a year or so now a days, you might look at investing in moving your projects to a dedicated hosting service with a SAL and guaranteed 100% uptime clause. Otherwise understand that there are communication channels in place and that issues come up.
Change 491004 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: fix up the config for the new server
Change 491004 merged by Bstorm:
[operations/puppet@production] toolsdb: fix up the config for the new server
Can you clarify - did you really mean 2019-02-26, or is the estimate actually for tomorrow (2019-02-19)? Also (if it is the extended period of an additional week) it would be nice to have a brief summary of the steps that are being taking that explain why it would take so long - you are replacing hardware? Copying and verifying data? Working on improved configuration of some sort?
Thanks!
Mentioned in SAL (#wikimedia-cloud) [2019-02-18T18:50:21Z] <chicocvenancio> moving paws back to toolsdb T216208
Repeating from cloud-announce: The change-over was successful. Thank you for your patience and understanding. ToolsDB should now be fully operational. Please report issues to the #wikimedia-cloud IRC channel.
T216441 is to track further recovery efforts around tables that are not replicated.
I think it was a typo. Either way, we beat the estimate I think as long as the problem doesn't come back on the new server. So far, so good.
FYI: I just moved the Phragile database aka s52552__phragile to it's own instance on CloudVPS and the app uses that DB now. When I get the ok from my colleagues I will remove it from ToolsDB.
Removing PAWS as it is no longer relevant to this task. Thanks for all the extra effort put into getting this solved quickly.
The only things still open under this are hardware that is in the process of decommissioning. I think all of our toolsdb action items from the outage are completed (not to say that there isn't still work to do).