Page MenuHomePhabricator

Expose blubberoid to the public allowing CI in WMCS to be able to reach out as well to it
Closed, ResolvedPublic

Description

In order to make use of the production deployment of Blubberoid in CI, integration labs hosts will need access to blubberoid.discovery.wmnet:8748.

dduvall@integration-slave-docker-1044:~$ nc -zv blubberoid.discovery.wmnet -w 3 8748
DNS fwd/rev mismatch: blubberoid.discovery.wmnet != blubberoid.svc.eqiad.wmnet
blubberoid.discovery.wmnet [10.2.2.31] 8748 (?) : Connection timed out
dduvall@integration-slave-docker-1044:~$ ping blubberoid.discovery.wmnet
PING blubberoid.discovery.wmnet (10.2.2.31) 56(84) bytes of data.
From ae2-1120.cr2-eqiad.wikimedia.org (10.64.22.3) icmp_seq=1 Packet filtered
From ae2-1120.cr2-eqiad.wikimedia.org (10.64.22.3) icmp_seq=2 Packet filtered
From ae2-1120.cr2-eqiad.wikimedia.org (10.64.22.3) icmp_seq=3 Packet filtered
^C
--- blubberoid.discovery.wmnet ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2003ms

RelEng also discussed in its weekly Release Pipeline meeting how we'd also like for developers to have access to a running blubberoid. If it's acceptable to SRE to open up public traffic to blubberoid as well, let's do that here as well. If not, we will discuss other options.

Event Timeline

dduvall updated the task description. (Show Details)Dec 18 2018, 8:48 PM
hashar added a subscriber: hashar.Dec 19 2018, 10:43 AM

the wmnet. top level domain is for production / internal services. WMCS is considered an alien and is not allowed to reach it. The service would need a public IP (on LVS?) and a DNS record under wikimedia.org..

A similar case is the Docker registry:

Publicdocker-registry.wikimedia.orgIN A 91.198.174.192 (text-lb-esams.wikimedia.org)
Privatedocker-registry.discovery.wmnetIN CNAME darmstadtium.eqiad.wmnet.

We could have used wikimedia.org for both external and internal use, that is doable by using DNS split horizon. A feature that lets DNS serves different response based on the client origin. But we went with different domains :)

I guess it depends on the use cases.

As SREs I don't think we would like to expose a public service that is not going to receive any traffic any time soon, mostly for operational and complexity reasons. I guess we don't really have that need yet, do we?

On the other hand I can clearly see the need for this for CI where it 'll be really nice to have. We do have a related issue with finding out how to run the rest of the kubernetes services in a way that helps people using deployment-prep, we should try to follow the same approach for that in order to avoid duplication of work

I guess it depends on the use cases.

As SREs I don't think we would like to expose a public service that is not going to receive any traffic any time soon, mostly for operational and complexity reasons. I guess we don't really have that need yet, do we?

But that is exactly the point of this task? Open the service to the outside world so we can start using it. Instances on WMCS can not reach production internal network (10.0.0.0/8), they get out via PAT/NAT and are considered to be just like any internet traffic (== untrusted).

On the other hand I can clearly see the need for this for CI where it 'll be really nice to have. We do have a related issue with finding out how to run the rest of the kubernetes services in a way that helps people using deployment-prep, we should try to follow the same approach for that in order to avoid duplication of work

I can't tell about kubernetes since I don't know where it lands in the network. But I am assuming it is the production private network as well (10.0.0.0/8) and thus can not be reached.

For CI we had the Jenkins master originally setup with a public IP since that is what we did back in the time. It ends up being able to reach WMCS with an extra route and some firewall rules between production and WMCS. But that is probably a technical debt that would need to be dropped eventually.

For Blubberoid, worse case scenario is we setup another instance in the WMCS support network. IT would then be reachable by WMCS instances (but not by production).

I guess it depends on the use cases.

As SREs I don't think we would like to expose a public service that is not going to receive any traffic any time soon, mostly for operational and complexity reasons. I guess we don't really have that need yet, do we?

But that is exactly the point of this task? Open the service to the outside world so we can start using it. Instances on WMCS can not reach production internal network (10.0.0.0/8), they get out via PAT/NAT and are considered to be just like any internet traffic (== untrusted).

Is it? Cause I gather that it is specifically for CI. In which case an instance in CI is architecturally a better construct.

On the other hand I can clearly see the need for this for CI where it 'll be really nice to have. We do have a related issue with finding out how to run the rest of the kubernetes services in a way that helps people using deployment-prep, we should try to follow the same approach for that in order to avoid duplication of work

I can't tell about kubernetes since I don't know where it lands in the network. But I am assuming it is the production private network as well (10.0.0.0/8) and thus can not be reached.

For CI we had the Jenkins master originally setup with a public IP since that is what we did back in the time. It ends up being able to reach WMCS with an extra route and some firewall rules between production and WMCS. But that is probably a technical debt that would need to be dropped eventually.

Yes. It is also very tangential to my question about the architecture of this. If CI moves entirely into/out of WMCS all of this becomes moot because in the former case it is architecturally more correct to talk to resources in its WMCS project and in the latter it can just use blubberoid.discovery.wmnet

For Blubberoid, worse case scenario is we setup another instance in the WMCS support network. IT would then be reachable by WMCS instances (but not by production).

I think you mean the CI project (or any project for that matter). WMCS support network is NFS hosts and labsdbs/lamon hosts

As SREs I don't think we would like to expose a public service that is not going to receive any traffic any time soon, mostly for operational and complexity reasons. I guess we don't really have that need yet, do we?

But that is exactly the point of this task? Open the service to the outside world so we can start using it. Instances on WMCS can not reach production internal network (10.0.0.0/8), they get out via PAT/NAT and are considered to be just like any internet traffic (== untrusted).

Is it? Cause I gather that it is specifically for CI. In which case an instance in CI is architecturally a better construct.

I may have been working under a certain set of assumptions when filing this task—that it was possible/OK to open up traffic from WMCS to this host in .wmnet—so let me try to back up and just clarify the use cases as I see them.

We have two use cases for blubberoid. One serves our deployment pipeline running in CI, allowing it to generate Dockerfiles for image builds. The second use case is for developers to make the same use of it, building images locally for testing, though exact tooling for this is still only conceptual. Perhaps most importantly, there's an overarching requirement for these two use cases that image building is as consistent as possible in the different environments.

It's certainly possible for us to set up our own instance of blubberoid running in the integration project (or another project in WMCS). However, IMO running a single deployment of blubberoid would bring us much closer to our goal of unifying tooling across environments—the same versions of Blubber config would always be supported, the same Blubber policies enforced, the same Dockerfile output generated, etc.

For CI we had the Jenkins master originally setup with a public IP since that is what we did back in the time. It ends up being able to reach WMCS with an extra route and some firewall rules between production and WMCS. But that is probably a technical debt that would need to be dropped eventually.

Yes. It is also very tangential to my question about the architecture of this. If CI moves entirely into/out of WMCS all of this becomes moot because in the former case it is architecturally more correct to talk to resources in its WMCS project and in the latter it can just use blubberoid.discovery.wmnet

Again, I don't think that having a specific instance of blubberoid in the integration project makes sense if the the overarching goal is to achieve consistent image builds across environments. A single deployment makes more sense to me.

As SREs I don't think we would like to expose a public service that is not going to receive any traffic any time soon, mostly for operational and complexity reasons. I guess we don't really have that need yet, do we?

But that is exactly the point of this task? Open the service to the outside world so we can start using it. Instances on WMCS can not reach production internal network (10.0.0.0/8), they get out via PAT/NAT and are considered to be just like any internet traffic (== untrusted).

Is it? Cause I gather that it is specifically for CI. In which case an instance in CI is architecturally a better construct.

I may have been working under a certain set of assumptions when filing this task—that it was possible/OK to open up traffic from WMCS to this host in .wmnet—so let me try to back up and just clarify the use cases as I see them.

Just for clarity's sake, WMCS machines talking directly to an internal production endpoint (.wmnet) would/should never happen. There is the exception of WMCS supporting services, but those are exceptions.

If anything the endpoint would be publicly exposed and WMCS would use the exact same way external users would.

We have two use cases for blubberoid. One serves our deployment pipeline running in CI, allowing it to generate Dockerfiles for image builds. The second use case is for developers to make the same use of it, building images locally for testing, though exact tooling for this is still only conceptual. Perhaps most importantly, there's an overarching requirement for these two use cases that image building is as consistent as possible in the different environments.

It's certainly possible for us to set up our own instance of blubberoid running in the integration project (or another project in WMCS). However, IMO running a single deployment of blubberoid would bring us much closer to our goal of unifying tooling across environments—the same versions of Blubber config would always be supported, the same Blubber policies enforced, the same Dockerfile output generated, etc.

For CI we had the Jenkins master originally setup with a public IP since that is what we did back in the time. It ends up being able to reach WMCS with an extra route and some firewall rules between production and WMCS. But that is probably a technical debt that would need to be dropped eventually.

Yes. It is also very tangential to my question about the architecture of this. If CI moves entirely into/out of WMCS all of this becomes moot because in the former case it is architecturally more correct to talk to resources in its WMCS project and in the latter it can just use blubberoid.discovery.wmnet

Again, I don't think that having a specific instance of blubberoid in the integration project makes sense if the the overarching goal is to achieve consistent image builds across environments. A single deployment makes more sense to me.

There is however a 3rd overarching use case/requirement which is to allow developers to use the tool even when they have no/flaky internet. There's more than 1 stories of a hackathon (or just parts of, like the quiet room) having no/flaky internet and that's where a lot of work happens. And it's not satisfied in any way by exposing any kind of endpoint or service, but rather by making blubber available to them.

To complicate things further, the version of blubber developers will have installed will almost invariably be out of date, if the tooling created

  • does not encourage keeping blubber updated
  • is built around the blubber as a service premise.

creating all the chaos you want to avoid by having blubber as a service. Which is one of the reasons I am hesitant about this.

The other one is that relying so much on blubber as a service would reduce our dogfooding approach on using the tool locally amplifying the issues already mentioned.

There is however a 3rd overarching use case/requirement which is to allow developers to use the tool even when they have no/flaky internet. There's more than 1 stories of a hackathon (or just parts of, like the quiet room) having no/flaky internet and that's where a lot of work happens. And it's not satisfied in any way by exposing any kind of endpoint or service, but rather by making blubber available to them.

Docker itself might prove to be more of an impediment to happy hackathons unless folks show up with base images already downloaded (roughly equivalent to the same problem mediawiki-vagrant has).

To complicate things further, the version of blubber developers will have installed will almost invariably be out of date, if the tooling created

  • does not encourage keeping blubber updated
  • is built around the blubber as a service premise.

    creating all the chaos you want to avoid by having blubber as a service. Which is one of the reasons I am hesitant about this.

    The other one is that relying so much on blubber as a service would reduce our dogfooding approach on using the tool locally amplifying the issues already mentioned.

One idea for keeping blubber releases up-to-date is to add a job to auto-release new blubber builds as part of pushing a tag.

I think I'd inject another use-case for "why blubberoid" vs blubber binary: a blubberoid used for CI is easier to update for folks working on blubber. We have permissions to deploy a webservice, but we don't have permissions to update the debian package which slows down release a bit.

There is however a 3rd overarching use case/requirement which is to allow developers to use the tool even when they have no/flaky internet. There's more than 1 stories of a hackathon (or just parts of, like the quiet room) having no/flaky internet and that's where a lot of work happens. And it's not satisfied in any way by exposing any kind of endpoint or service, but rather by making blubber available to them.

Docker itself might prove to be more of an impediment to happy hackathons unless folks show up with base images already downloaded (roughly equivalent to the same problem mediawiki-vagrant has).

Yes, but there is a difference. Being prepared (having all the required tooling installed and ready) vs having to reach out to a service often as part of the development process. That also makes it different to e.g. gerrit which is also essential, but uploading changes to it can be stalled for quite some time.

Overall, I think my point is that the tooling should not assume such a service is up and running and rely on it.

To complicate things further, the version of blubber developers will have installed will almost invariably be out of date, if the tooling created

  • does not encourage keeping blubber updated
  • is built around the blubber as a service premise.

    creating all the chaos you want to avoid by having blubber as a service. Which is one of the reasons I am hesitant about this.

    The other one is that relying so much on blubber as a service would reduce our dogfooding approach on using the tool locally amplifying the issues already mentioned.

One idea for keeping blubber releases up-to-date is to add a job to auto-release new blubber builds as part of pushing a tag.

I think I'd inject another use-case for "why blubberoid" vs blubber binary: a blubberoid used for CI is easier to update for folks working on blubber. We have permissions to deploy a webservice, but we don't have permissions to update the debian package which slows down release a bit.

That's true indeed.

thcipriani triaged this task as Normal priority.Dec 20 2018, 7:52 PM
thcipriani moved this task from Backlog to CI on the Release Pipeline board.

Discussed in today's deployment pipeline meeting.

Conclusion was that we would like to open this service up via a public dns address to be used by CI, but development tooling should rely on being usable locally (i.e., encourage the use as much as possible as of the blubber releases for local development)

akosiaris renamed this task from Allow access to blubberoid.discovery.wmnet:8748 to Expose blubberoid to the public allowing CI in WMCS to be able to reach out as well to it.Jan 22 2019, 2:06 PM

Change 485814 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Introduce blubberoid.wikimedia.org

https://gerrit.wikimedia.org/r/485814

Change 485823 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Introduce blubberoid.wikimedia.org in varnish

https://gerrit.wikimedia.org/r/485823

hashar removed a subscriber: hashar.Jan 25 2019, 1:56 PM

Change 485814 merged by Alexandros Kosiaris:
[operations/dns@master] Introduce blubberoid.wikimedia.org

https://gerrit.wikimedia.org/r/485814

Change 485823 merged by Alexandros Kosiaris:
[operations/puppet@production] Introduce blubberoid.wikimedia.org in varnish

https://gerrit.wikimedia.org/r/485823

akosiaris closed this task as Resolved.Feb 6 2019, 10:11 AM
akosiaris claimed this task.
curl -s https://blubberoid.wikimedia.org/?spec |head -5
---
openapi: '3.0.0'
info:
  title: Blubberoid
  description: >

So I guess this is now resolved. Feel free to reopen.