ToolsDB overload and cleanup
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bd808
	Feb 15 2019, 12:12 AM

Description

Details

	Subject	Repo	Branch	Lines +/-
	toolsdb: fix up the config for the new server	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
			Restricted Task
Resolved		None	T207536 Move various support services for Cloud VPS currently in prod into their own instances
			Unknown Object (Task)
Resolved		• chasemp	T172538 rack/setup/install labvirt10(19\|20).eqiad.wmnet
Resolved		• Bstorm	T216208 ToolsDB overload and cleanup
Resolved		• Bstorm	T216202 Disk failure on labsdb1005
Resolved		• Bstorm	T216167 Verify checkwiki tool against excessive DB usage
Resolved		• Bstorm	T216168 Review labsdb1005 MariaDB configuration against prod standards
Resolved		• Bstorm	T216170 toolsdb - Per-user connection limits
Declined		None	T216173 labsdb1005/6 - Upgrade to Stretch
Resolved		• Bstorm	T193264 Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020
			Unknown Object (Task)
Resolved		• Cmjohnson	T196507 Degraded RAID on cloudvirt1019
Resolved		• Cmjohnson	T194855 Degraded RAID on cloudvirt1020
Resolved		aborrero	T216353 toolsdb: firewalling changes for new setup (temporal mysql replication)
Declined		None	T216373 CloudVPS: run maintain-dbusers inside Toolforge
Declined		None	T208754 rename cloudvirt1019 and cloudvirt1020 to cloudvirtdb1001 and cloudvirtdb1002
Resolved		Halfak	T217922 Migrate Wikilabels from labsdb1004 to clouddb1002
Resolved		Halfak	T219563 Add a DNS alias for the wikilabels database (wikilabels.db.svc.eqiad.wmflabs)
Resolved		• Bstorm	T219652 Final migration of osmdb.eqiad.wmnet into Cloud VPS instances
Resolved		Jclark-ctr	T220144 Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet
Resolved		• Bstorm	T215993 tools.db.svc.eqiad.wmflabs hitting it's limit?
Resolved		Surlycyborg	T216213 s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues
Resolved		• Bstorm	T216441 Evaluate transferring the non-replicated tables to the new toolsdb server
Resolved		fnegri	T236101 Find a way to remove non-replicated tables from ToolsDB
Open		fnegri	T291782 Migrate largest ToolsDB users to Trove
Resolved		Andrew	T292546 cloud NFS: figure out backups for cinder volumes
Resolved		aborrero	T293752 cloud ceph: refactor rbd client puppet profiles
Duplicate		None	T294429 cinder-backups: figure out automation
Resolved		aborrero	T295584 eqiad: 2 VMs for cloudbackup-dev
Resolved		aborrero	T296413 cinder: get victoria point release in the bpo repo
Resolved		aborrero	T299708 network access to eqiad ceph cluster from cloudbackup2002
Resolved		Andrew	T339830 cinder-backup getting OOM-killed for large volumes
Resolved		Andrew	T344065 Replace cinder-backup process with backy2
Resolved		Andrew	T358855 Use cloudbackup100[12]-dev for cinder backup test/dev
Open		Andrew	T366071 'backy2 cleanup' not getting called properly on cloudbackup hosts
Resolved		Andrew	T323502 Move some of magnus's tools to Trove databases (was: Request increased quota for mix-n-match Toolforge tool)
Resolved		Andrew	T324984 Trove volume size limit of 31Gb
Resolved		fnegri	T326754 Clarify Trove and Toolsdb usage within WMCS
Resolved		TBurmeister	T326854 Publish revised Trove user guide
Resolved		fnegri	T328691 [toolsdb] Migrate linkwatcher db to Trove
Resolved		TheresNoTime	T334491 Update Linkwatcher tool to use new Trove database
Open		fnegri	T350862 [toolsdb] Migrate mixnmatch db to Trove
Resolved		fnegri	T301949 ToolsDB upgrade => Bullseye, MariaDB 10.4
Resolved		• taavi	T301993 [toolsdb] Enable gtid to help replication recovery
Resolved		• taavi	T311509 Request increased quota for tools Cloud VPS project
Declined		dcaro	T321828 [ceph] Tune performance for DB VMs
Resolved		Andrew	T324700 Attach a Cinder volume to clouddb1001/1002
Resolved		fnegri	T328273 apt error in clouddb1001
Resolved		fnegri	T328693 [toolsdb] Migrate "s54518__mw" db to Trove
Resolved		fnegri	T329521 [toolsdb] Set up tools-db-2 to replicate tools-db-1
Resolved		fnegri	T329970 [toolsdb] set up tools-db-1 to replicate from clouddb1001
Resolved		fnegri	T330655 monitoring: Disk space check can fail to read fuse mounts
Resolved		fnegri	T333330 Disk space low in cloudvirt10[19-20]
Resolved		fnegri	T333471 Move all tools from clouddb1001 to tools-db-1
Resolved		fnegri	T333567 Move ToolsDB Grafana dashboard to grafana.wmcloud.org
Invalid		fnegri	T334627 Unable to archive tools databases
Resolved		fnegri	T334924 ToolsDB: rename Cinder volumes
Resolved		• Bstorm	T216753 Document ToolsDB failover process for Clouddb Admins
Resolved		aborrero	T216769 wmcs: audit old hardware
Resolved		bd808	T216747 Decommission outdated and risky hardware
Resolved		Jclark-ctr	T187456 Decommission labstore100[123] and their disk shelves
Resolved		• Bstorm	T209527 Set up scratch and maps NFS services on cloudstore1008/9
Resolved		• Bstorm	T221806 Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9
Open		None	T224510 Document the new NFS setup on cloudstore1008/9
Resolved		• Bstorm	T224747 Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5
Resolved		Jclark-ctr	T266192 Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5
			Unknown Object (Task)
Resolved		• Bstorm	T285981 /data/scratch NFS space too high
Resolved		zhuyifei1999	T285982 video2commons needs cleanup on /data/scratch following NFS changes
Resolved		Nemo_bis	T285983 Please clean up data in /data/scratch/tmp
Resolved		• Bstorm	T224914 Getent check apparently isn't working on the new cloudstore1008/9 servers
Resolved		Jclark-ctr	T216749 Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet
Resolved		RobH	T214835 labstore1001,1002,2001,2002: status clarification
Resolved	Request	Papaul	T243329 decommission labstore2001.codfw.wmnet and labstore2002.codfw.wmnet
Duplicate		None	T216770 toolsdb: document (and test?) failover plan

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2019-02-15T06:40:58Z] <marostegui> Stop puppet on labsdb1005 to leave "max_user_connections" on my.cnf - T216170 T216208

Fuzheado subscribed.Feb 15 2019, 9:16 AM

Marostegui mentioned this in T216213: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues.Feb 15 2019, 9:29 AM

WMDE-Fisch subscribed.Feb 15 2019, 1:12 PM

In T216208#4956626, @Marostegui wrote:

All the ones from s52552 are:

75697	s52552	xx	s52552__phragile	Prepare	14024	Opening tables	select * from `sprints` where `sprints`.`id` = ? limit 1	0.000
75739	s52552	xx	s52552__phragile	Prepare	14001	Opening tables	select * from `sprints` where `sprints`.`id` = ? limit 1	0.000

Hi thanks for letting us know. The Phragile app is running on it's own VPS now and I'm not sure why we kept the DB on the tools server ( it's currently accessed from the VPS instance ). I'm trying to move the DB to the VPS so we can hopefully solve that problem and reduce the load.

Sjoerddebruin subscribed.Feb 15 2019, 2:44 PM

Mahir256 subscribed.Feb 15 2019, 3:38 PM

The problem has returned this morning. Rising user connections is a symptom of all tables becoming unusable and not the cause. WMCS is proceeding to try and stand up a new server to move to quickly.

It is exposing some interesting things and best practices that are missed in places, nonetheless!

ArthurPSmith subscribed.Feb 15 2019, 6:11 PM

aborrero closed subtask T196507: Degraded RAID on cloudvirt1019 as Resolved.Feb 15 2019, 6:53 PM

bd808 mentioned this in T214278: Quickstatements, "backend is overloaded".Feb 16 2019, 1:03 AM

Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.

Redalert2fan subscribed.Feb 16 2019, 10:40 AM

Ayack subscribed.Feb 16 2019, 10:49 AM

Sic19 subscribed.Feb 16 2019, 11:23 AM

Liuxinyu970226 subscribed.Feb 16 2019, 11:57 AM

Daniel_Mietchen subscribed.Feb 16 2019, 12:18 PM

Daniel_Mietchen triaged this task as High priority.Feb 16 2019, 12:22 PM

Simon_Villeneuve subscribed.Feb 16 2019, 12:39 PM

Jane023 subscribed.Feb 16 2019, 1:07 PM

In T216208#4959184, @Tagishsimon wrote:

Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.

Please keep your comments factual and leave out the attitude. Just subscribe to cloud-announce list ( https://lists.wikimedia.org/pipermail/cloud-announce/2019-February/thread.html ) so you know what is happening.

No, I'll not be doing that, any more than you just have. Whilst a serious outage is being treated merely as a curiosity, it is well worth reminding WMF employees that outages like this are far from acceptable.

In T216208#4959184, @Tagishsimon wrote:

Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.

I can give a guess-estimate. Given the complexity of some of the operations we are doing (specially to prevent serious data-loss), services probably won't be fully recovered until at least Tuesday next week (2019-02-26).

In T216208#4959316, @Tagishsimon wrote:

No, I'll not be doing that, any more than you just have. Whilst a serious outage is being treated merely as a curiosity, it is well worth reminding WMF employees that outages like this are far from acceptable.

Wikimedia Cloud Services is not an outage-free zone. Perhaps that should change, but then you really should be advocating for a lot more ressources for it. With its current resources the team does an amazing job serving the communities of developers and users.

On the original "no comment" remark: it really shows you are not informing yourself in the least. There have been multiple announcements on the mailing lists, repeated responses in #wikimedia-cloud freenode, several tasks on phabricator and a few incident reports on Wikitech. All being the preferred channels for communication and documentation for Wikimedia Cloud Services things. You may have missed those, as most users probably have, but that does not give you leave to assume they did not happen or to attack the people and services of the Cloud Services Team.

zhuyifei1999 mentioned this in T216320: Cannot connect to toolsdb.Feb 16 2019, 3:49 PM

Quick report:
The team made progress on getting the new server up yesterday and overnight. If we manage to get things live again before Tuesday, we'll get the word out! If not, I think things are moving along well to getting toolsdb back "working" by then. Thanks for your patience everyone!

In T216208#4959184, @Tagishsimon wrote:

Day 2 of no Quickstatements, no Listeria, no ETA for a resumption of service ... no comment from WMF on why what is a critical service - if we're taking this whole wikidata business at all seriously - is fritzed.

If I am not mistaken its a requirement to at least be subscribed to -announce, if not the primary list itself. You're complaining about a couple days (maybe a week total) outage. I remember the days when the database had over a year lag. Things used to break regularly and that was the norm. As users of the free hosting/services we understood issues arise. If you cannot deal with issues once a year or so now a days, you might look at investing in moving your projects to a dedicated hosting service with a SAL and guaranteed 100% uptime clause. Otherwise understand that there are communication channels in place and that issues come up.

Also, you know, you're running outside of production.

Change 491004 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: fix up the config for the new server

https://gerrit.wikimedia.org/r/491004

gerritbot added a project: Patch-For-Review.Feb 16 2019, 5:15 PM

Change 491004 merged by Bstorm:
[operations/puppet@production] toolsdb: fix up the config for the new server

https://gerrit.wikimedia.org/r/491004

Framawiki mentioned this in T216328: CopyPatrol - 500 - Internal Server Error.Feb 16 2019, 6:38 PM

Tom.Reding subscribed.Feb 16 2019, 6:41 PM

Jayprakash12345 subscribed.Feb 16 2019, 8:53 PM

DannyS712 subscribed.Feb 16 2019, 9:22 PM

Ghuron subscribed.Feb 17 2019, 4:35 AM

Kpjas subscribed.Feb 17 2019, 8:24 AM

bd808 mentioned this in T216167: Verify checkwiki tool against excessive DB usage.Feb 17 2019, 11:02 PM

Az1568 subscribed.Feb 18 2019, 1:04 AM

jrbs subscribed.Feb 18 2019, 1:49 AM

Audiodude subscribed.Feb 18 2019, 3:12 AM

Benjavalero subscribed.Feb 18 2019, 4:37 AM

CennoxX subscribed.Feb 18 2019, 10:11 AM

Alicia_Fagerving_WMSE subscribed.Feb 18 2019, 10:53 AM

russblau subscribed.Feb 18 2019, 12:06 PM

Graemebp subscribed.Feb 18 2019, 12:17 PM

In T216208#4959346, @aborrero wrote:

I can give a guess-estimate. Given the complexity of some of the operations we are doing (specially to prevent serious data-loss), services probably won't be fully recovered until at least Tuesday next week (2019-02-26).

Can you clarify - did you really mean 2019-02-26, or is the estimate actually for tomorrow (2019-02-19)? Also (if it is the extended period of an additional week) it would be nice to have a brief summary of the steps that are being taking that explain why it would take so long - you are replacing hardware? Copying and verifying data? Working on improved configuration of some sort?

Thanks!

Premeditated subscribed.Feb 18 2019, 4:51 PM

Euku subscribed.Feb 18 2019, 6:13 PM

Mentioned in SAL (#wikimedia-cloud) [2019-02-18T18:50:21Z] <chicocvenancio> moving paws back to toolsdb T216208

• Bstorm mentioned this in T216441: Evaluate transferring the non-replicated tables to the new toolsdb server.Feb 18 2019, 7:07 PM

Repeating from cloud-announce: The change-over was successful. Thank you for your patience and understanding. ToolsDB should now be fully operational. Please report issues to the #wikimedia-cloud IRC channel.

T216441 is to track further recovery efforts around tables that are not replicated.

In T216208#4961781, @ArthurPSmith wrote:

Can you clarify - did you really mean 2019-02-26, or is the estimate actually for tomorrow (2019-02-19)?

I think it was a typo. Either way, we beat the estimate I think as long as the problem doesn't come back on the new server. So far, so good.

borogovia subscribed.Feb 18 2019, 8:39 PM

Thanks to everyone who helped resolve this!

bd808 moved this task from Backlog to ToolsDB on the Data-Services board.Feb 19 2019, 1:08 AM

Hjfocs subscribed.Feb 19 2019, 10:08 AM

WMDE-Fisch mentioned this in T216507: Move Phragile DB to CloudVPS.Feb 19 2019, 1:57 PM

In T216208#4957167, @WMDE-Fisch wrote:
In T216208#4956626, @Marostegui wrote:
All the ones from s52552 are:
75697	s52552	xx	s52552__phragile	Prepare	14024	Opening tables	select * from `sprints` where `sprints`.`id` = ? limit 1	0.000
75739	s52552	xx	s52552__phragile	Prepare	14001	Opening tables	select * from `sprints` where `sprints`.`id` = ? limit 1	0.000
Hi thanks for letting us know. The Phragile app is running on it's own VPS now and I'm not sure why we kept the DB on the tools server ( it's currently accessed from the VPS instance ). I'm trying to move the DB to the VPS so we can hopefully solve that problem and reduce the load.

FYI: I just moved the Phragile database aka s52552__phragile to it's own instance on CloudVPS and the app uses that DB now. When I get the ok from my colleagues I will remove it from ToolsDB.

Removing PAWS as it is no longer relevant to this task. Thanks for all the extra effort put into getting this solved quickly.

Audiodude unsubscribed.Feb 20 2019, 2:31 AM

• GTirloni closed subtask T194855: Degraded RAID on cloudvirt1020 as Resolved.Feb 20 2019, 5:43 PM

• Bstorm reopened subtask T194855: Degraded RAID on cloudvirt1020 as Open.Feb 21 2019, 5:46 PM

Ghuron unsubscribed.Feb 21 2019, 6:04 PM

• Bstorm mentioned this in T216749: Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet.Feb 21 2019, 6:07 PM

• Bstorm closed subtask T216173: labsdb1005/6 - Upgrade to Stretch as Declined.Feb 21 2019, 6:18 PM

• Bstorm closed subtask T216168: Review labsdb1005 MariaDB configuration against prod standards as Resolved.Feb 21 2019, 6:23 PM

Andrew closed subtask T194855: Degraded RAID on cloudvirt1020 as Resolved.Feb 21 2019, 6:29 PM

• Bstorm closed subtask T215993: tools.db.svc.eqiad.wmflabs hitting it's limit? as Resolved.Feb 21 2019, 6:37 PM

• Bstorm closed subtask T216202: Disk failure on labsdb1005 as Resolved.Feb 21 2019, 6:42 PM

Euku unsubscribed.Feb 21 2019, 6:43 PM

• Bstorm added a subtask: T216753: Document ToolsDB failover process for Clouddb Admins.Feb 21 2019, 6:54 PM

bd808 added subtasks: T216769: wmcs: audit old hardware, T216749: Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet.Feb 21 2019, 11:24 PM

bd808 added a subtask: T216770: toolsdb: document (and test?) failover plan.Feb 21 2019, 11:27 PM

• Bstorm closed subtask T216170: toolsdb - Per-user connection limits as Resolved.Feb 22 2019, 5:47 PM

• Bstorm closed subtask T216167: Verify checkwiki tool against excessive DB usage as Resolved.Feb 22 2019, 5:52 PM

Surlycyborg closed subtask T216213: s52481__stats_global running CREATE DATABASE IF NOT EXISTS on too many queries causing locking issues as Resolved.Feb 25 2019, 8:09 PM

• Bstorm closed subtask T216441: Evaluate transferring the non-replicated tables to the new toolsdb server as Resolved.Mar 26 2019, 4:55 PM

• Bstorm changed the status of subtask T216749: Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet from Stalled to Open.Mar 26 2019, 5:01 PM

Maintenance_bot removed a project: Patch-For-Review.May 22 2019, 3:18 PM

• Bstorm closed subtask T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020 as Resolved.May 22 2019, 5:11 PM

• Bstorm closed subtask T216753: Document ToolsDB failover process for Clouddb Admins as Resolved.May 24 2019, 11:40 PM

The only things still open under this are hardware that is in the process of decommissioning. I think all of our toolsdb action items from the outage are completed (not to say that there isn't still work to do).

Papaul closed subtask T216749: Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as Resolved.Oct 17 2019, 11:22 PM

aborrero closed subtask T216769: wmcs: audit old hardware as Resolved.Nov 22 2019, 10:11 AM

MusikAnimal unsubscribed.Apr 26 2023, 11:45 PM

ToolsDB overload and cleanupClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

ToolsDB overload and cleanup
Closed, ResolvedPublic
Actions

Related Objects
Search...