Page MenuHomePhabricator

Target architecture without gallium.wikimedia.org
Closed, ResolvedPublic

Description

gallium.wikimedia.org has to be phased out. We need a document covering the target architecture that would be the foundation to let us do the migration and complete the host decommissioning.

Steps:

  • Diagrams of all components
  • Team review / approval
  • netops approval

Related Objects

Event Timeline

Was poking Brandon about letsencrypt and asked a few question related to misc cache routing. It does not support routing different URL paths to different backends. But configuring different backend is straightforward.

[22:14:31]  <hashar>	another related question is does our misc cache supports routing URL paths to different backends ? 
[22:14:55]  <hashar> like  https://integration.wikimedia.org/jenkins01  https://integration.wikimedia.org/jenkins02   https://integration.wikimedia.org/somethingelse
[22:15:03]  <hashar>	with each three served on different hosts?
[22:15:46]  <bblack>	hashar: it's possible, but we don't have any cases like that today as examples
[22:16:04]  <bblack>	there's some minor VCL infastructure to plumb for it, to template that out in generic terms so we can manage the config as data.

Relevant configuration file and logic handling:

[22:19:02]  <bblack>	the two together mean we can define new cache_misc services in the data in the manifest, and give them an attribute like req_host => integration.wikimedia.org
[22:19:12]  <bblack>	we just need to extend that to include path regexes as well

[22:21:02]  <hashar>	the $app_directors hash is straight forward and nicely abstract all the low level logic which is nice
[22:21:20]  <bblack>	yeah I'm hoping to continue expanding on that to remove a lot of one-off custom VCL
[22:21:42]  <bblack>	the ticket for the long-term view is at https://phabricator.wikimedia.org/T110717
[22:21:55]  <bblack>	right now only cache_misc has that at all, and it's a very minimal version of it so far

Question: Why some boxes above are blue and some are yellow (scandium for example and labnodepool1001)

Been busy with the wmf branches, bank holidays and migrating PHP jobs to Nodepool instances. Should be able to switch to this soonish though.

Question: Why some boxes above are blue and some are yellow (scandium for example and labnodepool1001)

The yellow one are the existing boxes, in prod and working.

The blue ones YYY, YYY (sic) and ZZZ represent group of services that are currently all running on gallium but could be split in three distinct units. We would want to discuss where to put them though, either:

  • regroup some further to use less machines
  • reshuffle/rethink everything to new hosts / existing hosts
  • use a machine or VM per service group

Additionally the blue boxes are reachable from the public internet via the misc varnish.

I would like to add a few more maps per service to better expose how each service is tied to the others. The big picture overview is not necessarily easy to understand.

I have made a few more drawings in Google Drawings, added them to a folder and shared it with subscribers of this task (hopefully).

Graphs are:

Overview:

CI Target 2016 - Overview.png (582×1 px, 81 KB)

Apache proxy routing on Gallium to redirect to Jenkins and Zuul:

CI Apache routes - v20160421.1.png (368×684 px, 33 KB)

Details about Zuul/Gearman interaction, highlighting Gearman is pretty central. The Gearman server is actually embedded inside Zuul but could be split to a standalone gearman daemon.

CI Target 2016 - details of Zuul_Gearman.png (420×815 px, 40 KB)

The various flows for Jenkins (there are few more):

CI Target 2016 - details of Jenkins.png (498×717 px, 38 KB)

From the overview graphs, the three blue box XXX YYY ZZZ is a way to split services currently all hosted on gallium. We would need help from ops to find out on which kind of hosts we want to migrate (metal, ganetic, labs, other), and in which realm/network.

For doc.wikimedia.org / integration.wikimedia.org , that is the typical 90's web hosting. Some documentation expects PHP to be available and thus run arbitrary code merged in the branches. We would probably want to move that to a host that is isolated from the rest of the infrastructure.

Zuul scheduler/Gearman services could be moved to hosts in labs like labnodepool and scandium. One concern is Zuul exposes a web services which would thus need to be proxied via misc-varnish. I am not sure whether misc-varnish could get a backend in a labs host. Maybe we can have some kind of DMZ / dedicated network for that.

Gearman is quite central, can be split to a different host than the Zuul scheduler. Gearman would need to be reachable by Jenkins/Nodepool/Zuul-mergers and Zuul scheduler.

Jenkins has a lot of disk I/O and a labs instance would not work for sure. It has a lot of sensible informations such as private ssh keys and passwords (none for prod though).

We would want to discuss about how to best split/host those services. Ganeti instances in eqiad sounds like good candidates if they can be in a network zone that has no access to rest of the prod and able to be reachable from labs.

C) we migrate the whole CI infra to labs

Talked about this at length with @hashar and @thcipriani and I think we're all in agreement that this is the best course of action. Couple of reasons:

  1. The slaves are already in labs
  2. There's basically no reason the master can't be in labs (see plan below)
  3. Jenkins configuration changes frequently and by multiple people, production instances are less ideal for this

Moving forward, let's do this:

  1. Clean up existing puppet config, remove anything that's unused and clarify any abstractions
  2. Puppetize Jenkins config
    1. Easier said than done -- start by just stuffing what we have into puppet as is
    2. Then work on trying to make it more manageable
  3. Create xlarge instance in labs to act as new jenkins master (could be masterS in the future)
    1. Double the ram as gallium, lol. We can jack up the jvm heap and hopefully better-tune jenkins to make use of the extra ram
    2. CPU will be fine, master's not super cpu-bound anyway
    3. Hopefully a better-tuned Jenkins can avoid a ton of disk IO, since that's really the slowest part of a labs instance. Not that gallium's disks were great.
  4. Begin using Zuul scheduler in labs to talk to Gerrit instead of the gallium-based Zuul. Zuul should be mostly (completely?) puppetized already and can exist on a separate host to keep it off the new Jenkins master.
  5. doc.wikimedia.org and integration.wikimedia.org reports will be published via git repositories. We'd need a new puppet setup for a dumb apache that will fetch these reports. It would be tiny and could easily live as a ganeti VM. Too many incoming links to doc.wm.o at the very least to move it all to *.wmflabs.org I think :\

Sound reasonable?

Did a quick architecture document on Google Doc though it is only shared to a few people. Release-Engineering-Team is going to refine it then present it to I guess netops

Is it correct that zuul can not be clustered? I.e. there can only be one and there is no failover/handover in its architecture?

The Zuul scheduler is indeed a SPOF as well as Nodepool.

Yes, but it's also a standalone service that can be isolated from Jenkins on its own node. This should help.

Isolating it from Jenkins in some form is a good idea. If zuul and nodepool are moved to labs would that make the whole CI more unreliable?

I can confirm that the zuul puppet roles do indeed work and can be independent of the jenkins installation. A few resources might be missing though if they're installed alone.

C) we migrate the whole CI infra to labs

Talked about this at length with @hashar and @thcipriani and I think we're all in agreement that this is the best course of action. Couple of reasons:

  1. The slaves are already in labs
  2. There's basically no reason the master can't be in labs (see plan below)
  3. Jenkins configuration changes frequently and by multiple people, production instances are less ideal for this

Moving forward, let's do this:

  1. Clean up existing puppet config, remove anything that's unused and clarify any abstractions
  2. Puppetize Jenkins config
    1. Easier said than done -- start by just stuffing what we have into puppet as is
    2. Then work on trying to make it more manageable
  3. Create xlarge instance in labs to act as new jenkins master (could be masterS in the future)
    1. Double the ram as gallium, lol. We can jack up the jvm heap and hopefully better-tune jenkins to make use of the extra ram
    2. CPU will be fine, master's not super cpu-bound anyway
    3. Hopefully a better-tuned Jenkins can avoid a ton of disk IO, since that's really the slowest part of a labs instance. Not that gallium's disks were great.
  4. Begin using Zuul scheduler in labs to talk to Gerrit instead of the gallium-based Zuul. Zuul should be mostly (completely?) puppetized already and can exist on a separate host to keep it off the new Jenkins master.
  5. doc.wikimedia.org and integration.wikimedia.org reports will be published via git repositories. We'd need a new puppet setup for a dumb apache that will fetch these reports. It would be tiny and could easily live as a ganeti VM. Too many incoming links to doc.wm.o at the very least to move it all to *.wmflabs.org I think :\

Sound reasonable?

@hashar, @demon, and I had another conversation in IRC late on Friday and decided to move in a different direction than putting everything on labs nodes.

There are several drawbacks to running all on labs:

  • SSL and wikimedia.org domains would both have to change
  • Storing of secrets (master passwords, etc) would no longer be all the secret in a labs project
  • Disk I/O for a labs instance might be a problem. It's at least an unknown.
  • Moving to labs requires quite a bit of refactoring of current, working puppet code

In the interest of expediency and because of the drawbacks mentioned, the current plan is to move all services to Scandium over the short-term. In the medium term, it would be nice to break out all of the services running on scandium to separate nodes so scandium is less of a SPOF machine.

I also think that moving CI entirely into Labs is not a very good idea.

CI is a critical service, in the sense that the fallout of it being down is large and thus it requires production-level uptime and incident response. There is a reason we call production, erm "production" :)

Labs availability is getting better but still lags behind vis-a-vis production. Labs is also a virtualized environment with whatever limitations that entails (e.g. IOPS) and is also a public-facing, shared environment which also comes with other limitations (e.g. security). These are essentially some of the concerns that @thcipriani is raising — and is right about.

Moreover, from a social perspective, we're in this mess precisely because corners were cut when this service was set up (unpuppetized etc.). I'm not convinced that /less/ eyes in this and less oversight from TechOps will result in a cleaner/easier to maintain service. It might sound easier to set up (no more waiting for +2s!) but frankly, I think there is a considerable risk this may result into a big bowl of an unpuppetized mess, even more unreliable and with even less people capable of troubleshooting it when things go south (e.g. the opsens that helped last week).

Finally and orthogonally to the above, I'd like to also echo everyone else who called for a more reliable architecture with less SPOFs. That said, spreading out the various SPOF services across multiple machines isn't actually /increase/ the reliability, but rather decrease it :)

Yeah, we're all in agreement that labs isn't the best option, except for spinning up the on-demand slave nodes which are ephemeral. The coordination-type nodes (Jenkins master, Zuul, etc etc) should remain in production.

What do you think of the proposed solution of moving the CI support node(s) (Zuul, Jenkins, Gearman) into the same subnet as other labs-support boxes (eg: strontium)? This would solve the problem of requiring a public IP--which, afaict, isn't needed for any reason, the actual apache sites live behind misc_web. The only question we were unsure of was whether misc_web can/should proxy for a node that's in that labs-support subnet. My thinking is yes, but please let me know more :)

Regarding +2 and moving quickly on puppet: I agree. The best supported scenario is when things are well puppetized and in sync with the production branch. In fact, I believe most of CI (gearman, zuul, most of Jenkins) are indeed puppetized at this time--although not well, cleanup is needed. The major caveat at the moment is the Jenkins configuration which is not *easily* puppetized. It's a myraid of xml files, some of which change often--this needs the most work.

Cleaning up the existing puppet, allowing services to be isolated will allow us to build a more robust CI that can properly serve our needs I believe.

From the doc https://docs.google.com/document/d/1FR6IOP_4rxHRhLesYuokB-AhUzKBMGE4PEoOz2Wpk3A/edit (privately shared with subscriber of this task) there are a few open questions:

  • Would Jenkins/Zuul web services on labs host network be able to be exposed via misc cache?

They are served from https://integration.wikimedia.org/ci/ and https://integration.wikimedia.org/zuul/ and thus benefit from misc cache TLS and the *.wikimedia.org certificate. Users authenticate with Jenkins using their labs LDAP credentials.

If we do that, that would mean internet requests ultimately being honored in the labs hosts support network which might be an issue. It is my understanding that it is a private network solely for support to labs projects.

  • scandium would need more disk space to host Jenkins artifacts and build logs.

We have a couple 150GBytes SSD in RAID with /srv/ssd having 128GBytes available right now. Would need at least 250 G ideally 500 G. Can we add 500G hard drives to scandium? Ideally the Jenkins build metadata would be on SSD with artifacts on hard disks, not sure whether we can split them though.

It is a clone of git repo integration/docroot.git and is a fairly simple web portal with PHP. Might be handled on scandium as well.

Publishing to doc.wikimedia.org split to T13790. It has similar questions as to whether we allow public traffic to be terminated in labs support host network and having PHP executing on machines there. Given they are doc / code from developers, we probably want to have it hosted on a different machine / network.

Wanted to poke these open questions we have for ops/@faidon/whoever knows :) about the current plan to use scandium as a gallium replacement in the near-term:

  • Would Jenkins/Zuul web services on labs host network be able to be exposed via misc cache?
  • scandium would need more disk space to host Jenkins artifacts and build logs.
  • Where does https://integration.wikimedia.org/ live?

To restate, I think we mostly have open questions about the network layout if we move all services to scandium:

  • Would varnish be able to talk to scandium since it's in the labs host network?
  • Can we add 500G disk-space to scandium? (unsure of process here if possible)
  • Possible to repoint integration.wikimedia.org? Any problems with moving that to the labs host network?

@chasemp made some comments that address the varnish/scandium question in IRC. Copy-pasted below:

<thcipriani>  we actually have some open questions about the new CI architecture: https://phabricator.wikimedia.org/T133300
<thcipriani>  not clear if we're going to be using contint1001 since it can't reach labs instances from its network(?)
<godog>       afaik it shouldn't via the internal labs addresses no
<godog>       don't quote me on that though :)
<akosiaris>   I agree, labs private IPs should be not be reachable from production private IPs and vice versa
<chasemp>     there are servers within prod private ip space accessible from private ip space in labs, primarily labs-support vlan things and ldap servers
<thcipriani>  yup. so we're thinking of moving to scandium which is in the labs host network, but since we do rely on the varnish misc cache: we're not sure if we can move into that network.
<thcipriani>  I think that's our main open question: whether we can still be behind the varnish cache if we move to scandium.
<chasemp>     thcipriani: you want scandium on a private vlan, acccessible by labs VM's, able to access labs VM's (22 only?), and able to be behind varnish for gerrit/jenkins?
<thcipriani>  chasemp: that is mostly my understanding. Most of my understanding comes from hasharAway so small bits of that may not be true, but I think the broad strokes are correct.
<chasemp>     thcipriani: why are we putting scandium is labs-hosts1-b-eqiad?
<chasemp>     in even
<thcipriani>  I don't understand the question
<akosiaris>  it's in labs-support1 btw
<chasemp>    labs-hosts is mostly for openstack inf, nodepool was included there for that reason and it has the labvirts and is also the transit network for actual labs vm's
<chasemp>    labs-support is generally services we consider production that provide functionality to labs vm's
<chasemp>    i.e. nfs, etc
<chasemp>    so labs-support seems ideologically the right place and there isn't a reason it couldn't be behind varnish, other than it may not be setup afaik
<chasemp>    I don't get why promethium is in labs-hosts
<chasemp>    but it's not a good misc services or misc things vlan

My take-away is that Scandium is in labs-support, which is good, and that there's no reason it couldn't be behind a varnish server.

My take-away is that Scandium is in labs-support, which is good, and that there's no reason it couldn't be behind a varnish server.

Yes that's fine. scandium can be behind misc cache as far as the networking side of things goes.

<chasemp>    I don't get why promethium is in labs-hosts
<chasemp>    but it's not a good misc services or misc things vlan

This is most likely a bit of a sidetrack, but as promethium.wikitextexp.eqiad.wmflabs is a baremetal machine in labs, isn't it supposed to be in that vlan? Or is there another more appropriate one?

<chasemp>    I don't get why promethium is in labs-hosts
<chasemp>    but it's not a good misc services or misc things vlan

This is most likely a bit of a sidetrack, but as promethium.wikitextexp.eqiad.wmflabs is a baremetal machine in labs, isn't it supposed to be in that vlan? Or is there another more appropriate one?

I'm pretty sure promethium is in labs-instances1-b-eqiad, not labs-hosts:

alex@alex-laptop:~/Development/Wikimedia/Operations-DNS (master)$ grep "\; 10" templates/10.in-addr.arpa | grep 10.68.16
; 10.68.16.0/21 - labs-instances1-b-eqiad

which covers 10.68.16.0 to 10.68.23.255 - promethium is 10.68.16.2
The only labs-hosts network in eqiad is labs-hosts1-b-eqiad which is 10.64.20.0/24

Edit: Okay, here's the source of the confusion as pointed out by @Dzahn:

krenair@bastion-01:~$ host promethium
promethium.eqiad.wmflabs has address 10.68.16.2
Host promethium.eqiad.wmflabs not found: 3(NXDOMAIN)
Host promethium.eqiad.wmflabs not found: 3(NXDOMAIN)
krenair@bastion-01:~$ host promethium.eqiad.wmnet
promethium.eqiad.wmnet has address 10.64.20.12

yes I'm not sure why the allocation in labs-hosts-b for

templates/wmnet:promethium 1H IN A 10.64.20.12

but this is the incorrect thing I was talking about above, side note is how did promethium become a bare metal allocation in the instance subnet? when? We agreed as a team not to do it as it's full of hacks, grey areas of responsibility, and generally not the solution to 99% of issues. That is tangential to this task though.

@chasemp @thcipriani @hashar had a one hour discussion to discuss the context and various requirements. Well covered, minutes might be published.

The wiki page on the summary of the task seems to be deleted with a comment of "not needed anymore". @hashar above mentioned a Google doc, which I have access to, but it seems to present a few different options that seem obsolete as well — it ignores comments here and elsewhere and concludes with a plan which I'm fairly sure the team has decided not to go to. The tasks are also a bit all over the place and I'm honestly getting overwhelmed :)

Is there a more recent (re-)architecture proposal for CI somewhere? Has an architecture being considered or proposed that isn't full of SPOFs?

Replacing a 5-year old server (gallium) with another one (scandium) makes no sense. Replacing it with a new server (contint1001) is better, but still not great — this server could still fail at any point in time. Having a properly Puppetized setup would help us in lowering our time to recovery by allowing us to reprovision a server quickly, but we'll still have an hours-long outage and we'll have to treat this as an emergency.

Is there perhaps a more medium-term plan that is preparing us for a more scalable/distributed/HA architecture? If not, can we open *that* discussion too?

Is there perhaps a more medium-term plan that is preparing us for a more scalable/distributed/HA architecture? If not, can we open *that* discussion too?

+1

The wiki page on the summary of the task seems to be deleted with a comment of "not needed anymore". @hashar above mentioned a Google doc, which I have access to, but it seems to present a few different options that seem obsolete as well — it ignores comments here and elsewhere and concludes with a plan which I'm fairly sure the team has decided not to go to. The tasks are also a bit all over the place and I'm honestly getting overwhelmed :)

Is there a more recent (re-)architecture proposal for CI somewhere?

There was definitely too many tasks and random documents. Most tasks have been closed now and I have cleaned up the Google Drive folder. I have edited this task description with a table listing all documents, copied below as a convenience:

The overview of the envisioned target architecture is in CI Target 2016 - architecture (Google drawing).

I have created a few more drawings limited to specific parts:


! In T133300#2491756, @faidon wrote:

Replacing a 5-year old server (gallium) with another one (scandium) makes no sense. Replacing it with a new server (contint1001) is better, but still not great — this server could still fail at any point in time.

I thought about reusing scandium.eqiad.wmnet since it is already in the envisioned network area, I had no clue it was an old server and I had the feeling that using contint1001 was overpowered for the task at hand thus wasting hardware.

That got clarified with Tyler / Chase a couple weeks ago and I have formally asked for contint1001 to be assigned and moved to labs support network T140257. The additional horsepower would give us free room to add more services to it.

I have updated the architecture drawing) to change scandium with contint1001.


! In T133300#2491756, @faidon wrote:

Having a properly Puppetized setup would help us in lowering our time to recovery by allowing us to reprovision a server quickly, but we'll still have an hours-long outage and we'll have to treat this as an emergency.

With the exception of Jenkins global configuration, everything else on gallium is deployed using Debian packages and is fully in Puppet. A replica has been made in labs by volunteer @Paladox with just a few hiera tweaks, we used that to validate Gerrit 2.12 and CI.

Puppetizing Jenkins global configuration is definitely in the backlog and will be tracked via T69027. It is pretty much going to be a blocker as:

  • we want CI to have multiple Jenkins master for CI so we can easily do hot maintenance
  • jobs that run on a daily basis could use to be on a standalone to make them more manageable / not mixed with rest of per patchset jobs
  • we have a few private use case floating around:
    • mobile want to use Jenkins to release the Android application T136662 T104207)
    • we have to automatize MediaWiki third party releases and Jenkins is a candidate
    • we might want to drive scap deployments with a Jenkins harness (just an idea for now).

June 2016 outage has shown that the Jenkins global config definitely has to be reproducible and we could even have any changes to be funneled via review/puppet (currently changes are made directly via the web interface).

All the rest is about applying a few role::ci:: puppet classes.

! In T133300#2491756, @faidon wrote:
Has an architecture being considered or proposed that isn't full of SPOFs?

Is there perhaps a more medium-term plan that is preparing us for a more scalable/distributed/HA architecture? If not, can we open *that* discussion too?

Yes definitely. I would like to work on that once we have gotten rid of the 5+ years server and have the infrastructure running on Jessie.

For the SPOFs that is roughly:

Jenkins

We would need at least one other Jenkins master which depends on having the setup in Puppet T69027.

Nodepool

Maintain the pool of instances. Over a year I do not remember it having been the direct cause of any incident.

Zuul

We had multiple issue which were mostly due to outdated code. I have almost finished catching up with upstream code which will make it more robust. The Zuul scheduler is not designed as a distributable system and is definitely a SPOF, but we could have a hot spare to mitigate that.

Lets focus on finding a new home for the CI services and phase out gallium/Precise then we can look at the SPOFs.

wmflabs

Whenever the wmflabs infra / OpenStack API has an issue, most jobs are unable to run either because the labs instance have troubles or we can't delete/spawn new ones. Nodepool can support multiple instances providers, we would want to works with Cloud-Services team eventually.

Having a properly Puppetized setup would help us in lowering our time to recovery by allowing us to reprovision a server quickly, but we'll still have an hours-long outage and we'll have to treat this as an emergency.

With the exception of Jenkins global configuration, everything else on gallium is deployed using Debian packages and is fully in Puppet. A replica has been made in labs by volunteer @Paladox with just a few hiera tweaks, we used that to validate Gerrit 2.12 and CI.

Eh, sorry, but the things paladox setup in labs were all manual and not from puppet roles. I talked to him about just that and how it would be better if we could actually use the roles.

What @hashar says is true that I got ci in labs, but I didn't use all the puppet clases gallium uses.

I used bits and pices of puppet due to when I did it we were still using gerrit 2.8 so I skipped the gerrit puppet rule except from the proxy gerrit rule and I also used the zuul puppet rule too. For jenkins as it is not puppitised I used a dpkg from the official jenkins website, we should really upgrade to jenkins 2.* it works really good.

Problems that need solving

  • We need to puppetise jenkins.
  • We need to make it easy to find the correct puppet rules.
  • We need to decide on puppet rules names and stay with those to prevent us causing problems in the future.

Having a properly Puppetized setup would help us in lowering our time to recovery by allowing us to reprovision a server quickly, but we'll still have an hours-long outage and we'll have to treat this as an emergency.

With the exception of Jenkins global configuration, everything else on gallium is deployed using Debian packages and is fully in Puppet. A replica has been made in labs by volunteer @Paladox with just a few hiera tweaks, we used that to validate Gerrit 2.12 and CI.

Eh, sorry, but the things paladox setup in labs were all manual and not from puppet roles. I talked to him about just that and how it would be better if we could actually use the roles.

Release-Engineering-Team team eg (@hashar @demon @mmodell) I could setup a test instance which is fully puppitised and is a copy of gallium but would need help to do that since I am new with puppet and not really sure how to do it and make an actual copy of gallium.

@Dzahn ^^

That came at length but as an outcome of T140257 the new machine contint1001 is going to use the same architecture as gallium, namely it is in the public production network with a public IP address.

Thus I am claiming this task is resolved, the architecture stay the same for now. If we later decide to rearchitecture we can well reuse some of the material there, but that will probably be a new task / epic.