Page MenuHomePhabricator

setup releases1001.eqiad.wmnet (was: setup mwreleases1001)
Closed, ResolvedPublic

Description

This task will track the provisioning and setup of a ganeti vm mwreleases1001.eqiad.wmnet on the ganeti01 eqiad svc cluster.

Overall request and criteria were discussed on parent task T163743. Since the request was approved, this task will simply track its setup.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+7 -7
operations/puppetproduction+12 -8
operations/puppetproduction+36 -43
operations/puppetproduction+43 -51
operations/puppetproduction+0 -9
operations/dnsmaster+3 -1
operations/puppetproduction+33 -8
operations/puppetproduction+84 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+25 -9
operations/puppetproduction+1 -37
operations/puppetproduction+1 -1
operations/puppetproduction+3 -0
operations/puppetproduction+2 -2
operations/puppetproduction+28 -19
operations/puppetproduction+26 -2
operations/dnsmaster+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+7 -0
operations/puppetproduction+6 -6
operations/dnsmaster+2 -2
operations/puppetproduction+2 -0
operations/puppetproduction+6 -1
operations/dnsmaster+3 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Ok, this isn't in site.pp yet, since I'm not sure which roles you want to assign. It is ready for addition though, since its already calling into puppet on its own.

I assume this includes shell access on the machine. This will need an admin group on it for that.

Would the existing "releasers-mediawiki" make sense ? (it is used for upload access to releases.wikimedia.org, members: [catrope, demon, hashar, reedy, thcipriani]) Or is this a different thing, that should have a different group of admins?

If needed, this group can then be added via hieradata/hosts/ before a new puppet role has been written.

I assume this includes shell access on the machine. This will need an admin group on it for that.

Would the existing "releasers-mediawiki" make sense ? (it is used for upload access to releases.wikimedia.org, members: [catrope, demon, hashar, reedy, thcipriani]) Or is this a different thing, that should have a different group of admins?

If needed, this group can then be added via hieradata/hosts/ before a new puppet role has been written.

The ideal group is RelEng + Security. But that list is a good start, we can add anyone who needs it beyond this. (Darian immediately comes to mind)

Change 356425 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add admin group releasers-mediawiki to mwreleases1001

https://gerrit.wikimedia.org/r/356425

Change 356425 merged by Dzahn:
[operations/puppet@production] add admin group releasers-mediawiki to mwreleases1001

https://gerrit.wikimedia.org/r/356425

mwreleasers now have shell access on mwreleases1001

Change 359073 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] rename mwreleases1001 to releases1001

https://gerrit.wikimedia.org/r/359073

We talked about this on IRC and came to the conclusion that we want to use this new VM for not just mediawiki releases but all releases and combine the current role on bromine with jenkins on here. Bromine will stay a host for static websites as originally intended.

So we'll rename this from mwreleases1001 to releases1001 and reinstall.

Change 359077 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] rename mwreleases1001 to releases1001

https://gerrit.wikimedia.org/r/359077

Mentioned in SAL (#wikimedia-operations) [2017-06-14T23:41:15Z] <mutante> mwreleases1001 - scheduled downtime, shutdown, kill VM, re-install as releases1001 (T164030)

Change 359073 merged by Dzahn:
[operations/dns@master] rename mwreleases1001 to releases1001

https://gerrit.wikimedia.org/r/359073

Change 359077 merged by Dzahn:
[operations/puppet@production] rename mwreleases1001 to releases1001

https://gerrit.wikimedia.org/r/359077

Mentioned in SAL (#wikimedia-operations) [2017-06-14T23:49:21Z] <mutante> ganeti: removed instance mwreleases1001, created new instance releases1001 with same parameters (2 VCPUS,4G memory, 1 x 128G disk) (T164030)

Mentioned in SAL (#wikimedia-operations) [2017-06-14T23:55:38Z] <mutante> mwreleases: revoke puppet cert, delete salt key, remove from icinga. releases1001 still syncing disks for a while (50m), being created... T164030

Dzahn renamed this task from setup mwreleases1001.eqiad.wmnet to setup releases1001.eqiad.wmnet (was: setup mwreleases1001).Jun 14 2017, 11:58 PM

Change 359080 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add releases1001.eqiad.wmnet to site.pp

https://gerrit.wikimedia.org/r/359080

Change 359080 merged by Dzahn:
[operations/puppet@production] add releases1001.eqiad.wmnet to site.pp

https://gerrit.wikimedia.org/r/359080

Change 359086 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: update MAC address of releases1001

https://gerrit.wikimedia.org/r/359086

Change 359086 merged by Dzahn:
[operations/puppet@production] install_server: update MAC address of releases1001

https://gerrit.wikimedia.org/r/359086

Change 359087 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: switch releases1001 to stretch

https://gerrit.wikimedia.org/r/359087

Change 359087 merged by Dzahn:
[operations/puppet@production] install_server: switch releases1001 to stretch

https://gerrit.wikimedia.org/r/359087

reinstalled as releases1001, with stretch. the "releasers-mediawiki" group has shell (again). will follow-up with a role/profile.

Change 359089 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: add new role/profile, add backups, install jenkins

https://gerrit.wikimedia.org/r/359089

Change 359198 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add IPv6 records for releases1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/359198

Change 359198 merged by Dzahn:
[operations/dns@master] add IPv6 records for releases1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/359198

Change 359089 merged by Dzahn:
[operations/puppet@production] releases: add new role/profile, add backups, install jenkins

https://gerrit.wikimedia.org/r/359089

Mentioned in SAL (#wikimedia-operations) [2017-06-15T23:37:18Z] <mutante> added stretch support for jenkins (https://gerrit.wikimedia.org/r/#/c/359227/, https://gerrit.wikimedia.org/r/#/c/359356/) | 'reprepro copy stretch-wikimedia jessie-wikimedia jenkins' to make .deb available on stretch | releases1001 now running jenkins , icinga recovered | (hashar) (T164030)

"releases: Duplicate most of the microsite setup on new host releases1001" (Chad) - https://gerrit.wikimedia.org/r/#/c/361803/

"releases1001: Introduce reprepo profile based on microsite" (Chad) - https://gerrit.wikimedia.org/r/#/c/361804/

Change 363105 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] rsync/releases: add dest_host parameter, only include what's needed

https://gerrit.wikimedia.org/r/363105

Change 363105 merged by Dzahn:
[operations/puppet@production] rsync/releases: add dest_host parameter, only include what's needed

https://gerrit.wikimedia.org/r/363105

Change 363106 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] rsync::quickdatacopy: fix commandline for cron, don't need file_path

https://gerrit.wikimedia.org/r/363106

Change 363106 merged by Dzahn:
[operations/puppet@production] rsync::quickdatacopy: fix commandline for cron, don't need file_path

https://gerrit.wikimedia.org/r/363106

release files are now auto-rsynced:

[releases1001:~] $ sudo crontab -l  | grep releases
*/10 * * * * /usr/local/sbin/srv-org-wikimedia-releases >/dev/null 2>&1

[releases1001:~] $  cat /usr/local/sbin/srv-org-wikimedia-releases 
#!/bin/sh

/usr/bin/rsync -a rsync://bromine.eqiad.wmnet/srv-org-wikimedia-releases /srv/org/wikimedia/releases/



[releases1001:~] $ sudo /usr/bin/rsync -av rsync://bromine.eqiad.wmnet/srv-org-wikimedia-releases /srv/org/wikimedia/releases/
receiving incremental file list

sent 106 bytes  received 43,088 bytes  86,388.00 bytes/sec
total size is 7,034,239,241  speedup is 162,852.23
[bromine:~] $ sudo iptables -L | grep rsync
ACCEPT     tcp  --  releases1001.eqiad.wmnet  anywhere             tcp dpt:rsync

[bromine:~] $ ps aux | grep rsync
root      9819  0.0  0.2  14632  2196 ?        Ss   00:01   0:00 /usr/bin/rsync --daemon --no-detach

Except permissions are messed up here for the debian directory, which should be owned by reprepro:reprepro. But each rsync run re-breaks it.

4.0K drwxr-xr-x  4    13927 prometheus-node-exporter 4.0K Nov  1  2016 debian

Mentioned in SAL (#wikimedia-operations) [2017-07-04T01:30:26Z] <mutante> releases1001: switching GID of reprepro and promemetheus-node-exporter group (1000 vs 1001), changing reprepro UID to 13927. using find -exec to fix all the permissions and make it identical to bromine. prevent permissions snafu when rsyncing (T164030)

permission issue fixed ^ , looks like this, just like on bromine, and also stays like that after a sync:

4.0K drwxr-xr-x  4 reprepro reprepro            4.0K Nov  1  2016 debian
4.0K drwxrwsr-x 30 root     releasers-mediawiki 4.0K May 31 17:58 mediawiki
4.0K drwxrwsr-x  6 brion    releasers-mobile    4.0K Apr 30  2014 mobile
4.0K -r--r--r--  1 www-data www-data             164 Jan 28  2016 releases-header.html
4.0K drwxr-xr-x  2 catrope  releasers-mediawiki 4.0K Jun 30  2014 VisualEditor
4.0K drwxrwsr-x  2 root     releasers-mediawiki 4.0K Aug 31  2016 wikidiff2

@greg I blocked him by saying we need to test the uploading (lots of history with having to debug it on existing setup when f.e. subbu uploads parsoid releases) and then not getting to it because busy with netmon. i said in ops meeting today this is what i want to do next. soon !

Change 368181 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: delete old microsites::releases class

https://gerrit.wikimedia.org/r/368181

Change 368183 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] cache::misc: add director for releases(1001)

https://gerrit.wikimedia.org/r/368183

Change 368184 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] cache::misc: switch director for releases to releases1001

https://gerrit.wikimedia.org/r/368184

Change 368183 merged by Dzahn:
[operations/puppet@production] cache::misc: add director for releases(1001)

https://gerrit.wikimedia.org/r/368183

Mentioned in SAL (#wikimedia-traffic) [2017-07-27T19:05:27Z] <mutante> added new misc::cache director "releases" for releases* servers, releases moving away from bromine (T164030)

Change 368184 merged by Dzahn:
[operations/puppet@production] cache::misc: switch director for releases to releases1001

https://gerrit.wikimedia.org/r/368184

Mentioned in SAL (#wikimedia-operations) [2017-07-27T19:43:21Z] <mutante> switching https://releases.wikimedia.org backend from bromine to releases1001 - all files have been rsynced (T164030)

Change 368181 merged by Dzahn:
[operations/puppet@production] releases: delete old microsites::releases class

https://gerrit.wikimedia.org/r/368181

17:27 < mutante> !log bromine sudo -E reprepro clearvanished to deleted unused precise-mediawiki causing reprepro errors
17:51 < mutante> !log releases1001 - rsynced reprepro db data from bromine

This made reprepro work, f.e.:

[releases1001:/srv/org/wikimedia/reprepro] $  sudo -E reprepro ls parsoid
parsoid | 0.7.1all | jessie-mediawiki | amd64, i386

Change 368333 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: rsync reprepro data, set active server in Hiera

https://gerrit.wikimedia.org/r/368333

Change 368333 merged by Dzahn:
[operations/puppet@production] releases: rsync reprepro data, set active server in Hiera

https://gerrit.wikimedia.org/r/368333

Change 381365 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases/jenkins: add ProxyPassReverse config line

https://gerrit.wikimedia.org/r/381365

Change 381365 merged by Dzahn:
[operations/puppet@production] releases/jenkins: add ProxyPassReverse config line

https://gerrit.wikimedia.org/r/381365

Mentioned in SAL (#wikimedia-operations) [2017-09-29T00:05:58Z] <mutante> releases1001 - stopped puppet, manually fixing --prefix=/ci setting for jenkins process, killing it, removing init.d file, starting with systemd, jenkins now up T164030

Mentioned in SAL (#wikimedia-releng) [2017-09-29T00:12:00Z] <mutante> jenkins now configured and running at https://releases.wikimedia.org/ci/ (T164030) - but needs additional admin users and puppet is still disabled for temp hack fix

Change 381368 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: fix jenkins_prefix, /ci not /jenkins

https://gerrit.wikimedia.org/r/381368

Change 381368 merged by Dzahn:
[operations/puppet@production] releases: fix jenkins_prefix, /ci not /jenkins

https://gerrit.wikimedia.org/r/381368

https://releases.wikimedia.org/ci/

is now usable. I went through the setup wizard, it said it was offline (not surprising) so it couldn't download plugins, i skipped plugin install, setup user for myself, then added user for Chad, dumped password in home dir.

Change 381473 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: add missing Jenkins proxy setup

https://gerrit.wikimedia.org/r/381473

Change 381473 merged by Dzahn:
[operations/puppet@production] releases: add missing Jenkins proxy setup

https://gerrit.wikimedia.org/r/381473

Change 381903 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add releases-jenkins to misc-web cluster

https://gerrit.wikimedia.org/r/381903

Change 381907 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add releases-jenkins apache/varnish, move jenkins proxy config

https://gerrit.wikimedia.org/r/381907

Change 381907 merged by Dzahn:
[operations/puppet@production] add releases-jenkins apache/varnish, move jenkins proxy config

https://gerrit.wikimedia.org/r/381907

Change 381903 merged by Dzahn:
[operations/dns@master] add releases-jenkins to misc-web cluster

https://gerrit.wikimedia.org/r/381903

Change 382038 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: drop /ci/ suffix for jenkins-proxy, unify templates

https://gerrit.wikimedia.org/r/382038

Change 382038 merged by Dzahn:
[operations/puppet@production] releases: drop /ci/ suffix for jenkins-proxy, unify templates

https://gerrit.wikimedia.org/r/382038

Change 382097 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases-jenkins: remove now unused jenkins_proxy file

https://gerrit.wikimedia.org/r/382097

Change 382097 merged by Dzahn:
[operations/puppet@production] releases-jenkins: remove now unused jenkins_proxy file

https://gerrit.wikimedia.org/r/382097

I had to revert the last changes since apache was failing to reload (noticed in cronspam) due to a syntax error, namely the fact that the prefix and http-port variables in apache-jenkins.conf.erb were not available and not rendered. This is due to the fact that they are not available in the class that uses apache-jenkins.conf.erb to create the apache config.. After checking a bit puppet I preferred to just revert since I didn't have much context :)

Thanks @elukey , that issue was known to me and the fix was pending in Gerrit at https://gerrit.wikimedia.org/r/#/c/382098/ . I didn't expect cronspam from it though. Service isn't in use yet.

Change 382098 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases: rm proxy_jenkins class, mv Apache includes

https://gerrit.wikimedia.org/r/382098

Change 382098 merged by Dzahn:
[operations/puppet@production] releases: rm proxy_jenkins class, mv Apache includes

https://gerrit.wikimedia.org/r/382098

Change 382214 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases-jenkins: fix proxy setup, prefix setting

https://gerrit.wikimedia.org/r/382214

Change 382214 merged by Dzahn:
[operations/puppet@production] releases-jenkins: fix proxy setup, prefix setting

https://gerrit.wikimedia.org/r/382214

Change 382221 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] releases-jenkins: fix prefix for proxy setup, pt.2

https://gerrit.wikimedia.org/r/382221

Change 382221 merged by Dzahn:
[operations/puppet@production] releases-jenkins: fix prefix for proxy setup, pt.2

https://gerrit.wikimedia.org/r/382221

https://releases-jenkins.wikimedia.org works now without the /ci/ prefix, is eqiad-only (while releases.wm.org is still active/active on both), Apache syntax isn't broken anymore, puppet run is fine and jenkins says we removed 4 violations.

Back to Chad. Jenkins should be usable now.

The machine itself is up and running with Jenkins as expected, resolving. Other changes are outside the scope of this task.