Page MenuHomePhabricator

ToolsDB overload and cleanup
Closed, ResolvedPublic

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
Resolved chasemp
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
DeclinedNone
Resolved Bstorm
Resolved Cmjohnson
Resolved Cmjohnson
Resolvedaborrero
DeclinedNone
DeclinedNone
ResolvedHalfak
ResolvedHalfak
Resolved Bstorm
ResolvedJclark-ctr
Resolved Bstorm
ResolvedSurlycyborg
Resolved Bstorm
Resolvedfnegri
Openfnegri
ResolvedAndrew
Resolvedaborrero
DuplicateNone
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
OpenAndrew
ResolvedAndrew
ResolvedAndrew
Resolvedfnegri
ResolvedTBurmeister
Resolvedfnegri
ResolvedTheresNoTime
Openfnegri
Resolvedfnegri
Resolved taavi
Resolved taavi
Declineddcaro
ResolvedAndrew
Resolvedfnegri
Resolvedfnegri
Resolvedfnegri
Resolvedfnegri
Resolvedfnegri
Resolvedfnegri
Resolvedfnegri
Resolvedfnegri
Invalidfnegri
Resolvedfnegri
Resolved Bstorm
Resolvedaborrero
Resolvedbd808
ResolvedJclark-ctr
Resolved Bstorm
Resolved Bstorm
OpenNone
Resolved Bstorm
ResolvedJclark-ctr
Resolved Bstorm
Resolvedzhuyifei1999
ResolvedNemo_bis
Resolved Bstorm
ResolvedJclark-ctr
ResolvedRobH
ResolvedRequestPapaul
DuplicateNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2019-02-15T06:40:58Z] <marostegui> Stop puppet on labsdb1005 to leave "max_user_connections" on my.cnf - T216170 T216208

All the ones from s52552 are:

75697	s52552	xx	s52552__phragile	Prepare	14024	Opening tables	select * from `sprints` where `sprints`.`id` = ? limit 1	0.000
75739	s52552	xx	s52552__phragile	Prepare	14001	Opening tables	select * from `sprints` where `sprints`.`id` = ? limit 1	0.000

Hi thanks for letting us know. The Phragile app is running on it's own VPS now and I'm not sure why we kept the DB on the tools server ( it's currently accessed from the VPS instance ). I'm trying to move the DB to the VPS so we can hopefully solve that problem and reduce the load.

The problem has returned this morning. Rising user connections is a symptom of all tables becoming unusable and not the cause. WMCS is proceeding to try and stand up a new server to move to quickly.

It is exposing some interesting things and best practices that are missed in places, nonetheless!

Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.

Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.

Please keep your comments factual and leave out the attitude. Just subscribe to cloud-announce list ( https://lists.wikimedia.org/pipermail/cloud-announce/2019-February/thread.html ) so you know what is happening.

No, I'll not be doing that, any more than you just have. Whilst a serious outage is being treated merely as a curiosity, it is well worth reminding WMF employees that outages like this are far from acceptable.

Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.

I can give a guess-estimate. Given the complexity of some of the operations we are doing (specially to prevent serious data-loss), services probably won't be fully recovered until at least Tuesday next week (2019-02-26).

No, I'll not be doing that, any more than you just have. Whilst a serious outage is being treated merely as a curiosity, it is well worth reminding WMF employees that outages like this are far from acceptable.

Wikimedia Cloud Services is not an outage-free zone. Perhaps that should change, but then you really should be advocating for a lot more ressources for it. With its current resources the team does an amazing job serving the communities of developers and users.

On the original "no comment" remark: it really shows you are not informing yourself in the least. There have been multiple announcements on the mailing lists, repeated responses in #wikimedia-cloud freenode, several tasks on phabricator and a few incident reports on Wikitech. All being the preferred channels for communication and documentation for Wikimedia Cloud Services things. You may have missed those, as most users probably have, but that does not give you leave to assume they did not happen or to attack the people and services of the Cloud Services Team.

Quick report:
The team made progress on getting the new server up yesterday and overnight. If we manage to get things live again before Tuesday, we'll get the word out! If not, I think things are moving along well to getting toolsdb back "working" by then. Thanks for your patience everyone!

Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.

If I am not mistaken its a requirement to at least be subscribed to -announce, if not the primary list itself. You're complaining about a couple days (maybe a week total) outage. I remember the days when the database had over a year lag. Things used to break regularly and that was the norm. As users of the free hosting/services we understood issues arise. If you cannot deal with issues once a year or so now a days, you might look at investing in moving your projects to a dedicated hosting service with a SAL and guaranteed 100% uptime clause. Otherwise understand that there are communication channels in place and that issues come up.

Also, you know, you're running outside of production.

Change 491004 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: fix up the config for the new server

https://gerrit.wikimedia.org/r/491004

Change 491004 merged by Bstorm:
[operations/puppet@production] toolsdb: fix up the config for the new server

https://gerrit.wikimedia.org/r/491004

I can give a guess-estimate. Given the complexity of some of the operations we are doing (specially to prevent serious data-loss), services probably won't be fully recovered until at least Tuesday next week (2019-02-26).

Can you clarify - did you really mean 2019-02-26, or is the estimate actually for tomorrow (2019-02-19)? Also (if it is the extended period of an additional week) it would be nice to have a brief summary of the steps that are being taking that explain why it would take so long - you are replacing hardware? Copying and verifying data? Working on improved configuration of some sort?

Thanks!

Mentioned in SAL (#wikimedia-cloud) [2019-02-18T18:50:21Z] <chicocvenancio> moving paws back to toolsdb T216208

Repeating from cloud-announce: The change-over was successful. Thank you for your patience and understanding. ToolsDB should now be fully operational. Please report issues to the #wikimedia-cloud IRC channel.

T216441 is to track further recovery efforts around tables that are not replicated.

Can you clarify - did you really mean 2019-02-26, or is the estimate actually for tomorrow (2019-02-19)?

I think it was a typo. Either way, we beat the estimate I think as long as the problem doesn't come back on the new server. So far, so good.

Thanks to everyone who helped resolve this!

All the ones from s52552 are:

75697	s52552	xx	s52552__phragile	Prepare	14024	Opening tables	select * from `sprints` where `sprints`.`id` = ? limit 1	0.000
75739	s52552	xx	s52552__phragile	Prepare	14001	Opening tables	select * from `sprints` where `sprints`.`id` = ? limit 1	0.000

Hi thanks for letting us know. The Phragile app is running on it's own VPS now and I'm not sure why we kept the DB on the tools server ( it's currently accessed from the VPS instance ). I'm trying to move the DB to the VPS so we can hopefully solve that problem and reduce the load.

FYI: I just moved the Phragile database aka s52552__phragile to it's own instance on CloudVPS and the app uses that DB now. When I get the ok from my colleagues I will remove it from ToolsDB.

Removing PAWS as it is no longer relevant to this task. Thanks for all the extra effort put into getting this solved quickly.

Bstorm claimed this task.

The only things still open under this are hardware that is in the process of decommissioning. I think all of our toolsdb action items from the outage are completed (not to say that there isn't still work to do).