Page MenuHomePhabricator

Investigate deployment concurrency limitations for ORES
Open, LowPublic

Description

While migrating the ORES services to our dedicated cluster, akosiaris discovered that the git servers can't handle the fully parallel deployment to 18 servers. As a workaround, he's limited concurrency to only fetch onto 3 nodes at a time.

We want to understand where this limitation comes from. Usually, our deployments will be small changes (10kB), and only occasionally we'll be updating all of our models (500MB). Maybe we can keep full parallelism for small deployments, and have some kind of command-line flag when deploying data-churning changes? Is the limitation related to absolute repository size, so bites us even when making small changes?

Event Timeline

awight created this task.Feb 26 2018, 5:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 26 2018, 5:19 PM
greg edited projects, added Scap; removed Release-Engineering-Team.Feb 26 2018, 7:28 PM
greg added a subscriber: greg.

Using a fan out method is how this is handled normally.

awight added a comment.EditedFeb 26 2018, 7:33 PM

Using a fan out method is how this is handled normally.

@greg Sorry in advance if I'm misunderstanding. I think that's correct, that scap asks all 18 machines to "git fetch" simultaneously, but that's part of why we have a problem with parallelism. If all 18 machines are allowed to execute the git commands simultaneously, we get 18 timeouts.

This is strange because scap has lots of optimizations built in, which use local git clones to reduce the amount of bandwidth requested from the git server. My uneducated guess is that the server still has to do processing related to the absolute size of our repos... so can only do so much of that concurrently.

greg added a comment.Mar 3 2018, 1:02 AM

Using a fan out method is how this is handled normally.

@greg Sorry in advance if I'm misunderstanding. I think that's correct, that scap asks all 18 machines to "git fetch" simultaneously, but that's part of why we have a problem with parallelism. If all 18 machines are allowed to execute the git commands simultaneously, we get 18 timeouts.

No sorry, I was too quick: fan out as in we have a manually crafted list of mwXXX hosts that are fanout nodes (one per rack, basically); they serve the fetch to the servers in their rack, while they fetch from tin.

awight removed a subscriber: awight.Mar 21 2019, 4:02 PM

@akosiaris Should we still pursue this?

Harej triaged this task as Low priority.Apr 9 2019, 9:34 PM

@akosiaris Should we still pursue this?

With the move to a different deployment tool in the pipeline it's gonna be work that eventually get tossed down the drain. So, I 'd say let's resolve this for now as declined and if for some reason we have to revisit we can reopen it