During an audit of HTTPS-related things (cf T132521#2202245), it was noted that gallium.wikimedia.org appears to only host two HTTP sites (doc.wikimedia.org and integration.wikimedia.org), both of which are currently revproxied through the cache_misc cluster. If gallium has no other reason that it needs to be on a public subnet, we should move it to an internal-subnet host to reduce its exposure to the wild Internet.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
cache_misc: change doc/integration.wm.o backend | operations/puppet | production | +2 -2 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Declined | None | T133150 Move gallium to an internal host? | |||
Declined | None | T137358 Migrate CI services from gallium to contint1001 | |||
Resolved | hashar | T137323 Firewall rules for labs support host to communicate with contint1001.wikimedia.org (new gallium) | |||
Resolved | hashar | T137279 Port Zuul package 2.1.0-95-g66c8e52 from Precise to Jessie | |||
Declined | None | T137293 Update all references to gallium and change it to contint1001 in integration/* | |||
Resolved | None | T137265 / on gallium is read only, breaking jenkins | |||
Resolved | hashar | T137418 Remove zuul-merger from gallium | |||
Resolved | hashar | T133300 Target architecture without gallium.wikimedia.org | |||
Declined | None | T138955 Can scandium.eqiad.wmnet receives a couple 500G hard drive in a RAID 1 array? | |||
Declined | None | T139938 eqiad: 2 300GB SSD disks for scandium.eqiad.wmnet | |||
Resolved | hashar | T139620 Move CI coverage reports out of integration.wikimedia.org to a new domain or doc.wm.o |
Event Timeline
it's also running Jenkins. added Hashar to answer if it need the public IP. also see T95757
gallium has been setup in 2011 and is still on Precise. It received a public IP to serves the Jenkins web interface. With time, all the http entry points have been migrated to be behind the misc varnish.
Beside doc/integration.wikimedia.org, the server host Jenkins, Zuul scheduler and Zuul merger. There are network flow from/to labnodepool1001 and scandium in the labs support network as well as flow to/from labs instances.
We have a tracking task to get rid of gallium T95757. There is not much cycles to work on it though, but a first step is to update the CI architecture documentation, specially to keep track of all the network flows: T102137. The outdated doc being at https://www.mediawiki.org/wiki/Continuous_integration/Architecture/Isolation
Maybe we can assign a private IP to gallium, then migrate network flows / update firewalls rules. Once everything is migrated phase out the public IP and rename the host to gallium.eqiad.wmnet.
We have created a sub project in Phabricator https://phabricator.wikimedia.org/project/view/1966/
First step is for Release-Engineering-Team to agree on an architecture via T133300 and propose it to SRE for validation.
I have drawn a summary of web services that ends up on gallium. One is on doc.wikimedia.org the three others are on integration.wikimedia.org. They all pass through misc varnish and the path based routing is done on the Apache on gallium via mod_proxy.
None of that needs a public IP for sure.
As I have mentioned earlier on this task, the Gearman daemon is reached by hosts in labs support network: labnodepool1001.eqiad.wmnet and scandium.eqiad.wmnet . I dont think we can make them to reach a private IP in prod, so we need to dispatch gallium services to different hosts/network. Going to be discussed on T133300.
With gallium that lost a disk today, we had contint1001.eqiad.wmnet allocated (Jessie and private IP). Switching services to it is T137358.
Change 293284 had a related patch set uploaded (by Hashar):
cache_misc: change doc/integration.wm.o backend
Change 293284 abandoned by Hashar:
cache_misc: change doc/integration.wm.o backend
Reason:
I have prepared this patch in case we had to switch the CI infra to contint if gallium proven to be lost.
That is nore more an urgency and we are considering a better long term plan via T133300
integration.wikimedia.org (with Zuul and Jenkins) is going to migrate to scandium.eqiad.wmnet
doc.wikimedia.org is looking for a new home. Potentially via T137890 or yet another task.
From T140257#2595926 and follow up response from ops, we are keeping the status quo of using a public IP.