See the parent task (T203625: mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild ) for reasoning. tl;dr: these hosts are reliably the last hosts to finish the scap-cdb-rebuild step (cpu and/or memory intensive, I haven't profiled it), which in turn is causing timeouts during our health checks.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T203625 mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild | |||
Duplicate | None | T203664 scap timeout checking index.php/api.php mwdebug1001 / mwdebug1002 | |||
Duplicate | PRODUCTION ERROR | None | T215368 First request after a MediaWiki sync times out on mwdebug | ||
Resolved | akosiaris | T212955 Increase mwdebugXXXX hosts CPU |
Event Timeline
IIRC the cdb files are generated by rebuildLocalisationCache.php which is CPU bounded and runs with up to 30 parallel tasks.
I previously found it was bound to a single CPU under HHVM due to hhvm.stats.enable_hot_profiler being enabled and enforcing CPU affinity (hence all 30 threads run on the same CPU). T191921#4557854 We then moved that to php7.0 since we want to migrate out of HHVM anyway.
To speed up the cdb rebuild, we thus need more cores available for mwdebug hosts. Currently /proc/cpuinfo reports a single core. Exact number of core to be determined, I guess we can use at least 4 ?
I think @fsero / @akosiaris should be able to bump the number of CPUs on those Ganeti instances :-] We can try with four?
Mentioned in SAL (#wikimedia-operations) [2019-02-07T09:41:49Z] <akosiaris> reboot mwdebug1001, mwdebug1002, mwdebug2001, mwdebug2002 for VCPU upgrade. T212955
I 've removed the memory part cause https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=mwdebug1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&from=now-7d&to=now shows that mwdebug1002 is never pressed for more memory. I 've also bumped vpu count to 4. I 'll resolve this for now, if we need more resources feel free to reopen.