Experiment with InnoDB buffer pool size on clouddb1019.eqiad.wmnet
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	dr0ptp4kt
	Sep 15 2023, 7:45 PM

Description

Via Alertmanager:

WARN Memory 95% used. Largest process: mysqld (1866555) = 73.8%

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=clouddb1019&service=MariaDB+memory

This is a task to bring down the InnoDB buffer pool size to reduce memory warnings and experiment with whether this causes no material query degradation. If a change here causes material query degradation or alerts begin to resurface again later, the change would need to be reverted and alternative root cause analysis (e.g., for memory leaks) and exploration of mitigations would need to be performed.

dr0ptp4kt@clouddb1019:~$ top -b -n 1 -o +%MEM | head -10 | tail -4
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1866555 mysql     20   0  377.0g 371.1g  13984 S  18.8  73.8  66036:54 mysqld
1866396 mysql     20   0  107.5g 102.6g  13836 S   0.0  20.4   6757:55 mysqld
 502617 root      20   0  171208  79248  78068 S   0.0   0.0  89:38.21 systemd-journal

For now the host seems stable, but this is being filed in anticipation of possible treatment next week.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T346464 Experiment with InnoDB buffer pool size on clouddb1019.eqiad.wmnet
		Open		None	T346826 clouddb1019 memory alert

Event Timeline

dr0ptp4kt created this task.Sep 15 2023, 7:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 15 2023, 7:45 PM

dr0ptp4kt added a subtask: T346826: clouddb1019 memory alert.Sep 19 2023, 10:08 PM

Whatever is decided, I'd prefer if we don't create configuration snowflakes. All hosts serving the same type of queries (web or analytics) should have the same configuration, otherwise we are going to run into problems in the long term.
Ideally we shouldn't even have differences between web and analytics roles, but I guess we can make an exception here, if we are really aware of the tradeoffs they might have too.

Makes sense. Links below for current mappings and trends. In broad strokes it looks like a total of 343G is just a bit too high in terms of how it creates memory pressure leading to warnings and alerts (and 395G and 403G probably moreso), but it looks like a total of 280G is lower than necessary. It looks like a target of 315G-320G total would probably be in the sweet spot. Hoping to discuss a bit more next week or week after.

https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/hosts/clouddb1013.yaml
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddb1013&var-datasource=thanos&var-cluster=mysql&from=now-30d&to=now&refresh=5m
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/hosts/clouddb1017.yaml
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddb1017&var-datasource=thanos&var-cluster=mysql&from=now-30d&to=now&refresh=5m

s1: 243G
s3: 100G

https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/hosts/clouddb1014.yaml
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddb1014&var-datasource=thanos&var-cluster=mysql&from=now-30d&to=now&refresh=5m
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/hosts/clouddb1018.yaml
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddb1018&var-datasource=thanos&var-cluster=mysql&from=now-30d&to=now&refresh=5m

s2: 95G
s7: 185G

https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/hosts/clouddb1015.yaml
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddb1015&var-datasource=thanos&var-cluster=mysql&from=now-30d&to=now&refresh=5m
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/hosts/clouddb1019.yaml
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddb1019&var-datasource=thanos&var-cluster=mysql&from=now-30d&to=now&refresh=5m

s4: 313G
s6: 90G

https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/hosts/clouddb1016.yaml
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddb1016&var-datasource=thanos&var-cluster=mysql&from=now-30d&to=now&refresh=5m
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/hosts/clouddb1020.yaml
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddb1020&var-datasource=thanos&var-cluster=mysql&from=now-30d&to=now&refresh=5m

s8: 210G
s5: 185G

https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/hosts/clouddb1021.yaml
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddb1021&var-datasource=thanos&var-cluster=mysql&from=now-30d&to=now&refresh=5m

s1: 70G
s2: 40G
s3: 40G
s4: 70G
s5: 40G
s6: 30G
s7: 50G
s8: 70G

Looping @Andrew . What I'm thinking is we could modify clouddb1015 and clouddb1019 with a grand total of 315G (splitting proportionally again, as this is multi-instance), then check back in on the graphs in a month and see if it's stabilized (I can set a weekly reminder to check at https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddb1019&var-datasource=thanos&var-cluster=mysql&from=now-90d&to=now&refresh=5m every week, too, just to see if it keeps climbing as has seemed to be the case between the last two restarts).

There's a nonzero chance some other servers will start exhibiting memory warning and critical behavior in the intervening period, but if we really want to see if things stabilize I think we'd need to give this thing enough time to run and then determine if we're comfortable applying to the other nodes.

@Marostegui @Andrew I was thinking to schedule a slot for this so we can do this synchronously in case of issues, maybe even recording steps for the wiki documentation. But it looks like there isn't great mutual availability soon. I do see maybe that @Ladsgroup could be available 9-11 October, or we could wait until the week after on for example 17-October (by then we may be inching up on the memory again, though!). I do see mutual availability on 27-September as well with Andrew, Amir, and me as well, but it's a later in the day for Amir and I worry about if people need to be online later in case of something unexpected, however unlikely it seems. Preference on a window for this?

Marostegui moved this task from Triage to Blocked on the DBA board.Oct 17 2023, 5:43 AM

I'd suggest dropping this everywhere rather than just two or 3 as I mentioned before. Otherwise it will be a bit of a mess with puppet.
Right now clouddb1015 is the one on a warning state on icinga.

I am going to decline this for now as there seem not to be a clear path for this as the ownership isn't clear. Please reopen if you feel this still needs to be addressed.

Makes sense to shelve for now @Marostegui.

Experiment with InnoDB buffer pool size on clouddb1019.eqiad.wmnetClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Experiment with InnoDB buffer pool size on clouddb1019.eqiad.wmnet
Closed, DeclinedPublic
Actions

Related Objects
Search...