Deployment git server can't supply ORES hosts in parallel
Open, NormalPublic

Description

We have the fetch-hosts concurrency manually set to 3, with 18 hosts to service, due to resource concerns and crashes during deployment. This is unfortunate, ideally we could fetch all 18 in parallel.

Git-lfs might win us back some performance here. Also, should we investigate multicast?

awight created this task.Apr 9 2018, 8:28 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 9 2018, 8:28 PM
Dzahn added a subscriber: Dzahn.Apr 18 2018, 12:17 AM

Which deployment server is this about? Production or deployment-prep or another?

Sorry--this is about production.

RobH triaged this task as Normal priority.May 3 2018, 4:46 PM
RobH added a subscriber: RobH.

As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in Operations and attempting to review if any are critical, or if they are normal priority.

This task appears to be normal priority, and I have set it such. If any one on this task disagrees, please comment and correct. Anything with a high priority or above typically requires response ahead of other items, so please ensure you have supporting documentation on why those priorities should be used.

Thanks!

greg added a subscriber: greg.Jul 6 2018, 8:09 PM

I think to do much on this we'll need some performance numbers.

Also, if this isn't 100% addressed by git-lfs reducing the load then I guess it's time to look into fan-out servers for ORES...

Ladsgroup claimed this task.Nov 7 2018, 5:59 PM
Ladsgroup added a subscriber: Ladsgroup.

Given that we have git lfs now, I want to increase the number of parallel requests gradually.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptNov 7 2018, 5:59 PM

Change 472230 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/ores/deploy@master] Bump wheels to HEAD

https://gerrit.wikimedia.org/r/472230

Change 472230 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Bump wheels to HEAD

https://gerrit.wikimedia.org/r/472230

Mentioned in SAL (#wikimedia-operations) [2018-11-07T21:03:28Z] <ladsgroup@deploy1001> Started deploy [ores/deploy@25dfa4f]: T191842 T197096

Mentioned in SAL (#wikimedia-operations) [2018-11-07T21:20:52Z] <ladsgroup@deploy1001> Finished deploy [ores/deploy@25dfa4f]: T191842 T197096 (duration: 17m 24s)

Now deployment time has been reduced from 22 minutes to 17 minutes. I will increase the number of parallel connections from 5 to 8 in the next try.

Change 472413 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/ores/deploy@master] Increase number of parallel connections to 9

https://gerrit.wikimedia.org/r/472413

Mentioned in SAL (#wikimedia-operations) [2018-11-09T14:39:08Z] <ladsgroup@deploy1001> Started deploy [ores/deploy@0728805]: T191842 T209060

Mentioned in SAL (#wikimedia-operations) [2018-11-09T14:48:40Z] <ladsgroup@deploy1001> deploy aborted: T191842 T209060 (duration: 09m 32s)

Change 472413 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Increase number of parallel connections to 9

https://gerrit.wikimedia.org/r/472413

Mentioned in SAL (#wikimedia-operations) [2018-11-09T14:49:21Z] <ladsgroup@deploy1001> Started deploy [ores/deploy@bb39f4b]: T191842 T209060, try II

Mentioned in SAL (#wikimedia-operations) [2018-11-09T15:04:04Z] <ladsgroup@deploy1001> Finished deploy [ores/deploy@bb39f4b]: T191842 T209060, try II (duration: 14m 43s)

With the new number of 9 parallel connection and 14 minutes to deploy (down from around half an hour), I think it's good now.