Increase mwdebugXXXX hosts CPU
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	greg
	Jan 4 2019, 5:32 PM

Description

See the parent task (T203625: mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild ) for reasoning. tl;dr: these hosts are reliably the last hosts to finish the scap-cdb-rebuild step (cpu and/or memory intensive, I haven't profiled it), which in turn is causing timeouts during our health checks.

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		None	T203625 mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild
Duplicate		None	T203664 scap timeout checking index.php/api.php mwdebug1001 / mwdebug1002
Duplicate	PRODUCTION ERROR	None	T215368 First request after a MediaWiki sync times out on mwdebug
Resolved		akosiaris	T212955 Increase mwdebugXXXX hosts CPU

Event Timeline

greg created this task.Jan 4 2019, 5:32 PM

hashar subscribed.Jan 4 2019, 6:14 PM

This comment was removed by hashar.

IIRC the cdb files are generated by rebuildLocalisationCache.php which is CPU bounded and runs with up to 30 parallel tasks.

I previously found it was bound to a single CPU under HHVM due to hhvm.stats.enable_hot_profiler being enabled and enforcing CPU affinity (hence all 30 threads run on the same CPU). T191921#4557854 We then moved that to php7.0 since we want to migrate out of HHVM anyway.

To speed up the cdb rebuild, we thus need more cores available for mwdebug hosts. Currently /proc/cpuinfo reports a single core. Exact number of core to be determined, I guess we can use at least 4 ?

greg renamed this task from Increase mwdebugXXXX hosts CPU and memory to Increase mwdebugXXXX hosts CPU and memory(?).Jan 4 2019, 6:21 PM

herron triaged this task as High priority.Jan 7 2019, 3:30 PM

herron added a project: vm-requests.

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 10:59 PM

Krinkle added a parent task: T203664: scap timeout checking index.php/api.php mwdebug1001 / mwdebug1002.Feb 6 2019, 2:52 AM

Krinkle added a parent task: T215368: First request after a MediaWiki sync times out on mwdebug.Feb 6 2019, 2:57 AM

I think @fsero / @akosiaris should be able to bump the number of CPUs on those Ganeti instances :-] We can try with four?

Mentioned in SAL (#wikimedia-operations) [2019-02-07T09:41:49Z] <akosiaris> reboot mwdebug1001, mwdebug1002, mwdebug2001, mwdebug2002 for VCPU upgrade. T212955

akosiaris renamed this task from Increase mwdebugXXXX hosts CPU and memory(?) to Increase mwdebugXXXX hosts CPU.Feb 7 2019, 9:45 AM

I 've removed the memory part cause https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=mwdebug1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&from=now-7d&to=now shows that mwdebug1002 is never pressed for more memory. I 've also bumped vpu count to 4. I 'll resolve this for now, if we need more resources feel free to reopen.

hashar mentioned this in T215368: First request after a MediaWiki sync times out on mwdebug.Feb 7 2019, 9:51 AM

hashar mentioned this in T203625: mwdebug1001 and mwdebug1002 are reliably the last two hosts to finish scap-cdb-rebuild .

hashar mentioned this in T224026: mwdebug2002.codfw.wmnet and mwdebug1002.eqiad.wmnet need more vCPU: scap-cdb-rebuild is too slow.May 21 2019, 2:51 PM

Increase mwdebugXXXX hosts CPUClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Increase mwdebugXXXX hosts CPU
Closed, ResolvedPublic
Actions

Related Objects
Search...