Page MenuHomePhabricator

Increase mwdebugXXXX hosts CPU
Closed, ResolvedPublic

Description

See the parent task (T203625: mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild ) for reasoning. tl;dr: these hosts are reliably the last hosts to finish the scap-cdb-rebuild step (cpu and/or memory intensive, I haven't profiled it), which in turn is causing timeouts during our health checks.

Event Timeline

greg created this task.Jan 4 2019, 5:32 PM
hashar added a subscriber: hashar.Jan 4 2019, 6:14 PM
This comment was removed by hashar.
hashar added a comment.EditedJan 4 2019, 6:19 PM

IIRC the cdb files are generated by rebuildLocalisationCache.php which is CPU bounded and runs with up to 30 parallel tasks.

I previously found it was bound to a single CPU under HHVM due to hhvm.stats.enable_hot_profiler being enabled and enforcing CPU affinity (hence all 30 threads run on the same CPU). T191921#4557854 We then moved that to php7.0 since we want to migrate out of HHVM anyway.

To speed up the cdb rebuild, we thus need more cores available for mwdebug hosts. Currently /proc/cpuinfo reports a single core. Exact number of core to be determined, I guess we can use at least 4 ?

greg renamed this task from Increase mwdebugXXXX hosts CPU and memory to Increase mwdebugXXXX hosts CPU and memory(?).Jan 4 2019, 6:21 PM
herron triaged this task as High priority.Jan 7 2019, 3:30 PM
herron added a project: vm-requests.

I think @fsero / @akosiaris should be able to bump the number of CPUs on those Ganeti instances :-] We can try with four?

Mentioned in SAL (#wikimedia-operations) [2019-02-07T09:41:49Z] <akosiaris> reboot mwdebug1001, mwdebug1002, mwdebug2001, mwdebug2002 for VCPU upgrade. T212955

akosiaris renamed this task from Increase mwdebugXXXX hosts CPU and memory(?) to Increase mwdebugXXXX hosts CPU.Feb 7 2019, 9:45 AM
akosiaris closed this task as Resolved.Feb 7 2019, 9:48 AM
akosiaris claimed this task.

I 've removed the memory part cause https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=mwdebug1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&from=now-7d&to=now shows that mwdebug1002 is never pressed for more memory. I 've also bumped vpu count to 4. I 'll resolve this for now, if we need more resources feel free to reopen.