Page MenuHomePhabricator

New ganeti VM for MW release pipeline work
Closed, ResolvedPublic

Description

In the current quarter, RelEng and Security are pairing up on improving our release pipeline. We plan to install a standalone Jenkins installation--no Zuul, not connected to rest of CI or Gerrit--to run daily jobs to generate our tar files & patches.

The existing Jenkins setup is not suitable since these builds will contain unreleased security patches and access will be locked down. So we need a dedicated host but there's no need to waste physical hardware for this.

I'm thinking like 2 cores, 4gb of ram, ~125gb of disk should be plenty. Probably don't even need quite that much disk. Standard partition scheme (large /srv, everything else on /) is good.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Should this exist in both DCs? one in eqiad one in codfw per default nowadays?

Should this exist in both DCs? one in eqiad one in codfw per default nowadays?

I suppose we can...But it's not really something that has to be available if a DC is down though. It's the sort of service that can just wait patiently for the DC to be restored, nobody will miss it for a week or so.

Should this exist in both DCs? one in eqiad one in codfw per default nowadays?

I suppose we can...But it's not really something that has to be available if a DC is down though. It's the sort of service that can just wait patiently for the DC to be restored, nobody will miss it for a week or so.

(Also, the data generated will be local to that machine, so we would have an incomplete view of nightlies if we did fallover)

So in doing some research on the eqiad ganeti cluster, it seems that while most hosts are using 1 vcpu, a few have 2 or more, so its not unheard of.

I'd suggest we use the hostname jenkins-jobrunner[12]001 since its job running in patch and tar building?

Daniel's question on location is still not wholly determined. If a DC is offline, or we fail over from one to the other, should this service live where the master jenkins server is living, or is it wholly decoupled from the rest of the jenkins service?

Basically we tend to mirror ganeti requests between codfw and eqiad, since either site should be able to do the full job of the other. If this service is something where if we lost one site for a period of days, and this will simply remain happily offline, then I suppose it doesn't need to be in both. However, if it needs to live where the jenkins master server lives, then it should likely be setup in both for standby/fallback (like gerrit/jenkins is, correct?)

Once we know where to put it, I'll spin it up.

I'd suggest we use the hostname jenkins-jobrunner[12]001 since its job running in patch and tar building?

Yay, naming! How about jenkins-mw-builder[12]001?

Daniel's question on location is still not wholly determined. If a DC is offline, or we fail over from one to the other, should this service live where the master jenkins server is living, or is it wholly decoupled from the rest of the jenkins service?

It's wholly decoupled from the rest of Jenkins. Neither of them will know about each other. It does need access to Gerrit for basic cloning/fetching operations, but that's it (Gerrit redundancy is handled in T152525, which I need to wrap up)

Basically we tend to mirror ganeti requests between codfw and eqiad, since either site should be able to do the full job of the other. If this service is something where if we lost one site for a period of days, and this will simply remain happily offline, then I suppose it doesn't need to be in both. However, if it needs to live where the jenkins master server lives, then it should likely be setup in both for standby/fallback (like gerrit/jenkins is, correct?)

We could lose it for a week and almost nobody would care. It's basically an internal service for publishing nightlies that we can (on short notice) spin off as releases. If we had completely zero access to a whole DC for a week, I can think of a dozen other services we'd be sad about before I'd even think about this.

While it needs gerrit for basic cloning/fetch, it won't care if it is within its same datacenter, correct? If it is fetching a lot of data though, it would be better to keep it where it fetches that data, to reduce traffic over cross-dc links.

So we should setup where gerrit tends to live as primary (likely eqiad, correct?) but knowing it isn't needed, and its only in the same site to reduce cross-dc traffic.

While it needs gerrit for basic cloning/fetch, it won't care if it is within its same datacenter, correct? If it is fetching a lot of data though, it would be better to keep it where it fetches that data, to reduce traffic over cross-dc links.

Nope, it doesn't matter, as long as it's reachable. Most data is transferred on initial branching (stable releases, not weekly wmf branches) when we do a full clone. Otherwise it is just fetching the deltas.

So we should setup where gerrit tends to live as primary (likely eqiad, correct?) but knowing it isn't needed, and its only in the same site to reduce cross-dc traffic.

My thought was eqiad to speed up the process. There's nothing private going cross-DC, and even if it was it'd be over HTTPS. But yes, closest to Gerrit is best.

I'd suggest we use the hostname jenkins-jobrunner[12]001 since its job running in patch and tar building?

Yay, naming! How about jenkins-mw-builder[12]001?

drive-by naming comment
+1, I see jobrunner being confused with mw's jobrunner

We have CI hosts like contint1001 / contint2001. What about a generic name like: contint1002.eqiad.wmnet ?

For network traffic, there is an initial spike of download for the git repositories, once populated and assuming we keep them between builds, only delta will be transfered. Regardless, even cloning everything it going to be a few GBytes of data. The network traffic is low and there is only a few network round trips.

We have CI hosts like contint1001 / contint2001. What about a generic name like: contint1002.eqiad.wmnet ?

Because it would likely be confused with the rest of CI--I'm not a fan of this name. It also prevents us from making generic site.pp entries with a regex for all contint* machines.

I do note your comment @fgiunchedi, I'm definitely open to suggestions.

For network traffic, there is an initial spike of download for the git repositories, once populated and assuming we keep them between builds, only delta will be transfered. Regardless, even cloning everything it going to be a few GBytes of data. The network traffic is low and there is only a few network round trips.

It's not really going to be that much data. We're cloning far fewer repos than we do in CI. It's just core + bundled extensions we care about here.

I'd be fine with something like mwreleases1001!

We have CI hosts like contint1001 / contint2001. What about a generic name like: contint1002.eqiad.wmnet ?

This system is de-coupled from the rest of the CI systems, and a VM to boot. It would be nice if the hostname described what it did a bit more.

I'd be fine with something like mwreleases1001!

I like this better, since it will be the only place we're generating mw releases on cluster, and its quite descriptive of the service. Anyone hate this hostname?