In the period I have been working here, I have seen certain very specific hosts having either swapping issues or OOMs. In most cases, they are not due to bad database configuration, but to over-committing resources on edge cases (see T107072). Those edge cases are not related to how "busy" the server is (enwiki has not those problems), but probably to one of several of this issues:
- Mixing more than one engine, so MyISAM, InnoDB and TokuDB compete for resources
- Unusual workloads (UPDATES/ALTERs by software deployments on a large number of tables, reboots, horrible queries long queries with no understanding of underlying hardware on non-core servers)
- Too many objects, which require more OS-controlled memory (buffers). This has probably grown recently due to thousands of new objects being created in the the default shard
While there could be better hardware coming for the old production and labs db hosts, I think we should check this.
Proposal:
- Analyze past OOMs and swap usage in the past globally, specially the s3 shard, research and dbstore hosts and labsdbhosts
- Reduce the InnoDB buffer pool (s3) or analyze a better balance between that and TokuDB, MyISAM, Aria resources (labs, dbstore) and OS buffering (all)
Pros
- If it works, better throughput, less performance-drop spikes
- Potential speed performance for buffer-related tasks (binlogs-replication-lag, etc.)
- More reliability
Cons
- Worse performance latency, maybe thoughput, due to less memory available to InnoDB/MySQL
Alternatives
- Just wait for better hardware
- Reshard s3 or create another default shard (this does not work for labs/dbstore)