Page MenuHomePhabricator

ToolsDB overload and cleanup
Closed, ResolvedPublic

Related Objects

StatusAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Openaborrero
Openaborrero
Resolvedchasemp
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
DeclinedNone
ResolvedBstorm
ResolvedCmjohnson
ResolvedCmjohnson
Resolvedaborrero
DeclinedNone
DeclinedNone
ResolvedHalfak
ResolvedHalfak
ResolvedBstorm
OpenRobH
ResolvedBstorm
ResolvedSurlycyborg
ResolvedBstorm
ResolvedBstorm
OpenNone
OpenNone
OpenCmjohnson
ResolvedBstorm
ResolvedBstorm
OpenBstorm
OpenBstorm
ResolvedBstorm
OpenCmjohnson
OpenRobH
DuplicateNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2019-02-15T06:39:35Z] <marostegui> Restart labsdb1005 with max_user_connections = 20 T216208

Marostegui added a comment.EditedFeb 15 2019, 6:39 AM

I have restarted the server with max_user_connections = 20 to try to mitigate this, the server was unusable anyways.

Mentioned in SAL (#wikimedia-operations) [2019-02-15T06:40:58Z] <marostegui> Stop puppet on labsdb1005 to leave "max_user_connections" on my.cnf - T216170 T216208

All the ones from s52552 are:

75697	s52552	xx	s52552__phragile	Prepare	14024	Opening tables	select * from `sprints` where `sprints`.`id` = ? limit 1	0.000
75739	s52552	xx	s52552__phragile	Prepare	14001	Opening tables	select * from `sprints` where `sprints`.`id` = ? limit 1	0.000

Hi thanks for letting us know. The Phragile app is running on it's own VPS now and I'm not sure why we kept the DB on the tools server ( it's currently accessed from the VPS instance ). I'm trying to move the DB to the VPS so we can hopefully solve that problem and reduce the load.

The problem has returned this morning. Rising user connections is a symptom of all tables becoming unusable and not the cause. WMCS is proceeding to try and stand up a new server to move to quickly.

It is exposing some interesting things and best practices that are missed in places, nonetheless!

Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.

Ayack added a subscriber: Ayack.Feb 16 2019, 10:49 AM
Sic19 added a subscriber: Sic19.Feb 16 2019, 11:23 AM
Daniel_Mietchen triaged this task as High priority.Feb 16 2019, 12:22 PM

Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.

Please keep your comments factual and leave out the attitude. Just subscribe to cloud-announce list ( https://lists.wikimedia.org/pipermail/cloud-announce/2019-February/thread.html ) so you know what is happening.

No, I'll not be doing that, any more than you just have. Whilst a serious outage is being treated merely as a curiosity, it is well worth reminding WMF employees that outages like this are far from acceptable.

Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.

I can give a guess-estimate. Given the complexity of some of the operations we are doing (specially to prevent serious data-loss), services probably won't be fully recovered until at least Tuesday next week (2019-02-26).

No, I'll not be doing that, any more than you just have. Whilst a serious outage is being treated merely as a curiosity, it is well worth reminding WMF employees that outages like this are far from acceptable.

Wikimedia Cloud Services is not an outage-free zone. Perhaps that should change, but then you really should be advocating for a lot more ressources for it. With its current resources the team does an amazing job serving the communities of developers and users.

On the original "no comment" remark: it really shows you are not informing yourself in the least. There have been multiple announcements on the mailing lists, repeated responses in #wikimedia-cloud freenode, several tasks on phabricator and a few incident reports on Wikitech. All being the preferred channels for communication and documentation for Wikimedia Cloud Services things. You may have missed those, as most users probably have, but that does not give you leave to assume they did not happen or to attack the people and services of the Cloud Services Team.

Quick report:
The team made progress on getting the new server up yesterday and overnight. If we manage to get things live again before Tuesday, we'll get the word out! If not, I think things are moving along well to getting toolsdb back "working" by then. Thanks for your patience everyone!

Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.

If I am not mistaken its a requirement to at least be subscribed to -announce, if not the primary list itself. You're complaining about a couple days (maybe a week total) outage. I remember the days when the database had over a year lag. Things used to break regularly and that was the norm. As users of the free hosting/services we understood issues arise. If you cannot deal with issues once a year or so now a days, you might look at investing in moving your projects to a dedicated hosting service with a SAL and guaranteed 100% uptime clause. Otherwise understand that there are communication channels in place and that issues come up.

Also, you know, you're running outside of production.

Change 491004 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: fix up the config for the new server

https://gerrit.wikimedia.org/r/491004

Change 491004 merged by Bstorm:
[operations/puppet@production] toolsdb: fix up the config for the new server

https://gerrit.wikimedia.org/r/491004

Ghuron added a subscriber: Ghuron.Feb 17 2019, 4:35 AM
Kpjas added a subscriber: Kpjas.Feb 17 2019, 8:24 AM
Az1568 added a subscriber: Az1568.Feb 18 2019, 1:04 AM
jrbs added a subscriber: jrbs.Feb 18 2019, 1:49 AM

I can give a guess-estimate. Given the complexity of some of the operations we are doing (specially to prevent serious data-loss), services probably won't be fully recovered until at least Tuesday next week (2019-02-26).

Can you clarify - did you really mean 2019-02-26, or is the estimate actually for tomorrow (2019-02-19)? Also (if it is the extended period of an additional week) it would be nice to have a brief summary of the steps that are being taking that explain why it would take so long - you are replacing hardware? Copying and verifying data? Working on improved configuration of some sort?

Thanks!

Euku added a subscriber: Euku.Feb 18 2019, 6:13 PM

Mentioned in SAL (#wikimedia-cloud) [2019-02-18T18:50:21Z] <chicocvenancio> moving paws back to toolsdb T216208

Repeating from cloud-announce: The change-over was successful. Thank you for your patience and understanding. ToolsDB should now be fully operational. Please report issues to the #wikimedia-cloud IRC channel.

T216441 is to track further recovery efforts around tables that are not replicated.

Can you clarify - did you really mean 2019-02-26, or is the estimate actually for tomorrow (2019-02-19)?

I think it was a typo. Either way, we beat the estimate I think as long as the problem doesn't come back on the new server. So far, so good.

Thanks to everyone who helped resolve this!

bd808 moved this task from Backlog to ToolsDB on the Data-Services board.Feb 19 2019, 1:08 AM
Hjfocs added a subscriber: Hjfocs.Feb 19 2019, 10:08 AM

All the ones from s52552 are:

75697	s52552	xx	s52552__phragile	Prepare	14024	Opening tables	select * from `sprints` where `sprints`.`id` = ? limit 1	0.000
75739	s52552	xx	s52552__phragile	Prepare	14001	Opening tables	select * from `sprints` where `sprints`.`id` = ? limit 1	0.000

Hi thanks for letting us know. The Phragile app is running on it's own VPS now and I'm not sure why we kept the DB on the tools server ( it's currently accessed from the VPS instance ). I'm trying to move the DB to the VPS so we can hopefully solve that problem and reduce the load.

FYI: I just moved the Phragile database aka s52552__phragile to it's own instance on CloudVPS and the app uses that DB now. When I get the ok from my colleagues I will remove it from ToolsDB.

Removing PAWS as it is no longer relevant to this task. Thanks for all the extra effort put into getting this solved quickly.

Ghuron removed a subscriber: Ghuron.Feb 21 2019, 6:04 PM
Euku removed a subscriber: Euku.Feb 21 2019, 6:43 PM
Bstorm closed this task as Resolved.May 24 2019, 11:43 PM
Bstorm claimed this task.

The only things still open under this are hardware that is in the process of decommissioning. I think all of our toolsdb action items from the outage are completed (not to say that there isn't still work to do).