Page MenuHomePhabricator

Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster
Open, Needs TriagePublic

Description

Ubuntu Trusty is gone (at least, from our project) and Debian Jessie instance creation just got disabled (see T218119: Disable jessie VM creation in VPS).
Therefore the following instances are not reproducible in their current state. If they get lost to a hardware failure and are not able to be set up on stretch, the service they ran may be SOL.
So it's time to begin migrating our 34 Jessie instances towards Stretch.
You'll notice a Buster prerelease image is available to deployment-prep alongside Stretch. Please don't use this unless production is running the same service on buster, or you are setting up a fresh service that will be on buster when deployed to production, or you are working on migrating the production service to buster. Buster is now released and available publicly, go nuts.
The following deployment-prep instances are running Jessie:

NameStatus/task
deployment-memc04.deployment-prep.eqiad.wmflabsWIP, Buster plans for memc in T213089: Upgrade memcached for Debian Stretch/Buster
deployment-memc05.deployment-prep.eqiad.wmflabsWIP, Buster plans for memc in T213089: Upgrade memcached for Debian Stretch/Buster
deployment-memc06.deployment-prep.eqiad.wmflabsWIP, Buster plans for memc in T213089: Upgrade memcached for Debian Stretch/Buster
deployment-memc07.deployment-prep.eqiad.wmflabsWIP, Buster plans for memc in T213089: Upgrade memcached for Debian Stretch/Buster
deployment-changeprop.deployment-prep.eqiad.wmflabsAppears at T198901, depending on timing it may disappear before jessie becomes EOL
deployment-cpjobqueue.deployment-prep.eqiad.wmflabsAppears at T198901, depending on timing it may disappear before jessie becomes EOL
deployment-mcs01.deployment-prep.eqiad.wmflabsAppears at T198901, depending on timing it may disappear before jessie becomes EOL
deployment-restbase01.deployment-prep.eqiad.wmflabsAppears at T198901, depending on timing it may disappear before jessie becomes EOL
deployment-restbase02.deployment-prep.eqiad.wmflabsAppears at T198901, depending on timing it may disappear before jessie becomes EOL
deployment-sca01.deployment-prep.eqiad.wmflabsRuns eventstreams, graphoid, recommendation-api, and apertium, which all appear at T198901, depending on timing they may disappear before jessie becomes EOL.
deployment-sca02.deployment-prep.eqiad.wmflabsRuns eventstreams, graphoid, recommendation-api, and apertium, which all appear at T198901, depending on timing they may disappear before jessie becomes EOL.
deployment-sca04.deployment-prep.eqiad.wmflabsAppears to just run recommendation-api which appears at T198901, depending on timing it may disappear before jessie becomes EOL
deployment-imagescaler01.deployment-prep.eqiad.wmflabs02 and 03 are stretch, plus T216815: Upgrade Thumbor to Buster - is 01 needed? looks like it serves as a memc host for the other imagescalers, can we give that responsibility to one of the others?
deployment-etcd-01.deployment-prep.eqiad.wmflabsprod etcd* hosts are jessie - T224549: Track remaining jessie systems in production - some work might be done in T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster though our one is not for k8s
deployment-fluorine02.deployment-prep.eqiad.wmflabsprod mwlog* hosts are jessie - T224565: Migrate mwlog/udp2log servers to Buster
deployment-ircd.deployment-prep.eqiad.wmflabsprod host kraz is jessie - T224579: Migrate irc.wikimedia.org/kraz to Stretch/Buster
deployment-sentry01.deployment-prep.eqiad.wmflabsspoke to tgr, SRE might be making Sentry inside k8s later this year, in the mean time this is currently unused but people may want to use it to test frontend logic - probably no point migrating it to stretch though
deployment-logstash2.deployment-prep.eqiad.wmflabsWIP, see subtask

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Krenair updated the task description. (Show Details)Apr 9 2019, 8:16 PM
Krenair updated the task description. (Show Details)Apr 11 2019, 4:34 PM
Krenair updated the task description. (Show Details)Apr 12 2019, 5:11 AM
Krenair updated the task description. (Show Details)Apr 12 2019, 3:39 PM
Krenair updated the task description. (Show Details)Apr 13 2019, 12:08 AM
Krenair updated the task description. (Show Details)Apr 13 2019, 11:53 PM
Krenair updated the task description. (Show Details)Apr 19 2019, 3:13 AM
Krenair updated the task description. (Show Details)Apr 23 2019, 9:31 AM

Mentioned in SAL (#wikimedia-releng) [2019-04-23T14:41:29Z] <Krenair> Shut down deployment-ms-be03 and deployment-ms-be04 T218729

Krenair updated the task description. (Show Details)Apr 23 2019, 2:42 PM
Krenair updated the task description. (Show Details)Apr 23 2019, 2:45 PM
Krenair updated the task description. (Show Details)Apr 25 2019, 3:53 AM

Mentioned in SAL (#wikimedia-releng) [2019-04-25T16:35:35Z] <Krenair> shutting down deployment-ms-fe02 and deployment-poolcounter04 T218729

Krenair updated the task description. (Show Details)Apr 25 2019, 4:37 PM

Mentioned in SAL (#wikimedia-releng) [2019-04-26T13:33:37Z] <Krenair> shut off deployment-conf03 after discussion with otto.mata and elu.key - it seems ancient, broken, unused. T218729

Krenair updated the task description. (Show Details)Apr 26 2019, 1:34 PM
Krenair updated the task description. (Show Details)Apr 27 2019, 4:41 PM
Krenair updated the task description. (Show Details)Apr 27 2019, 5:58 PM
Krenair updated the task description. (Show Details)Apr 27 2019, 6:13 PM

re: logstash, prod hosts are stretch so starting up a stretch instance with the same roles/hiera is expected to work. There will be a couple of migrations involved, namely moving kafka and elasticsearch off to the stretch instance, cc @herron @colewhite

On Thursday when I get rid of deployment-ms-fe02 and deployment-poolcounter04 we should have enough room in the quota again to create an xlarge, at which point I can make a new logstash instance with the same roles etc.
I don't know about doing the migrations though, I'm not familiar with kafka or elasticsearch.

Krenair updated the task description. (Show Details)May 2 2019, 4:58 PM
Krenair added a comment.EditedMay 2 2019, 5:26 PM

re: logstash, prod hosts are stretch so starting up a stretch instance with the same roles/hiera is expected to work. There will be a couple of migrations involved, namely moving kafka and elasticsearch off to the stretch instance, cc @herron @colewhite

I've set up the instance as deployment-logstash03 - looks like the elastalert package is missing though. On the existing instance it appears to just be installed locally:

krenair@deployment-logstash2:~$ apt-cache policy elastalert
elastalert:
  Installed: 0.1.39-1~bpo9+1
  Candidate: 0.1.39-1~bpo9+1
  Version table:
 *** 0.1.39-1~bpo9+1 0
        100 /var/lib/dpkg/status
Krenair updated the task description. (Show Details)May 2 2019, 5:26 PM

re: logstash, prod hosts are stretch so starting up a stretch instance with the same roles/hiera is expected to work. There will be a couple of migrations involved, namely moving kafka and elasticsearch off to the stretch instance, cc @herron @colewhite

I've set up the instance as deployment-logstash03 - looks like the elastalert package is missing though. On the existing instance it appears to just be installed locally:

Indeed, that's my bad! I've fixed https://gerrit.wikimedia.org/r/c/operations/puppet/+/502773 to include component/elastalert which is where the package (for stretch) lives, however ATM I can't rebase cleanly on deployment-puppetmaster03 due to conflicts in unrelated patches. Anyways once 502773 and 505762 are cherry-picked again on puppetmaster then puppet should work as intended.

re: logstash, prod hosts are stretch so starting up a stretch instance with the same roles/hiera is expected to work. There will be a couple of migrations involved, namely moving kafka and elasticsearch off to the stretch instance, cc @herron @colewhite

I've set up the instance as deployment-logstash03 - looks like the elastalert package is missing though. On the existing instance it appears to just be installed locally:

Indeed, that's my bad! I've fixed https://gerrit.wikimedia.org/r/c/operations/puppet/+/502773 to include component/elastalert which is where the package (for stretch) lives, however ATM I can't rebase cleanly on deployment-puppetmaster03 due to conflicts in unrelated patches. Anyways once 502773 and 505762 are cherry-picked again on puppetmaster then puppet should work as intended.

No worries I'm happy to take care of that sort of problem, I've got the puppet repo sorted out (though it's a bit of a mess and the commit involved doesn't make quite as much sense as it used to now, should return to sort that out later), I've updated those cherry-picks and encountered a couple of issues. The first is a simple missing group which I've left a comment for on the gerrit changeset and amended our cherry-pick to fix, the other is around elastalert.util.EAException: Invalid Rule file: /etc/elastalert/security/rules/base.inc20190503-1970-1rnja7p

Krenair updated the task description. (Show Details)May 3 2019, 3:57 PM
Krenair updated the task description. (Show Details)May 3 2019, 4:09 PM
Krenair updated the task description. (Show Details)May 3 2019, 4:57 PM

re: logstash, prod hosts are stretch so starting up a stretch instance with the same roles/hiera is expected to work. There will be a couple of migrations involved, namely moving kafka and elasticsearch off to the stretch instance, cc @herron @colewhite

I've set up the instance as deployment-logstash03 - looks like the elastalert package is missing though. On the existing instance it appears to just be installed locally:

Indeed, that's my bad! I've fixed https://gerrit.wikimedia.org/r/c/operations/puppet/+/502773 to include component/elastalert which is where the package (for stretch) lives, however ATM I can't rebase cleanly on deployment-puppetmaster03 due to conflicts in unrelated patches. Anyways once 502773 and 505762 are cherry-picked again on puppetmaster then puppet should work as intended.

No worries I'm happy to take care of that sort of problem, I've got the puppet repo sorted out (though it's a bit of a mess and the commit involved doesn't make quite as much sense as it used to now, should return to sort that out later), I've updated those cherry-picks and encountered a couple of issues. The first is a simple missing group which I've left a comment for on the gerrit changeset and amended our cherry-pick to fix, the other is around elastalert.util.EAException: Invalid Rule file: /etc/elastalert/security/rules/base.inc20190503-1970-1rnja7p

Thanks! Yes the validation should be disabled for base.inc, I did so in the last PS of https://gerrit.wikimedia.org/r/c/operations/puppet/+/505762 but won't have time today to rebase/cherry-pick/test, although let me know if you still run into problems!

No worries I'm happy to take care of that sort of problem, I've got the puppet repo sorted out (though it's a bit of a mess and the commit involved doesn't make quite as much sense as it used to now, should return to sort that out later), I've updated those cherry-picks and encountered a couple of issues. The first is a simple missing group which I've left a comment for on the gerrit changeset and amended our cherry-pick to fix, the other is around elastalert.util.EAException: Invalid Rule file: /etc/elastalert/security/rules/base.inc20190503-1970-1rnja7p

Thanks! Yes the validation should be disabled for base.inc, I did so in the last PS of https://gerrit.wikimedia.org/r/c/operations/puppet/+/505762 but won't have time today to rebase/cherry-pick/test, although let me know if you still run into problems!

No rebase problems today, just had to fix Bool -> Boolean and validate_cmd undef instead of '' in the commit and re-cherry-pick it. Puppet is happy on the new logstash03 instance now.

Krenair updated the task description. (Show Details)May 7 2019, 3:39 PM

No worries I'm happy to take care of that sort of problem, I've got the puppet repo sorted out (though it's a bit of a mess and the commit involved doesn't make quite as much sense as it used to now, should return to sort that out later), I've updated those cherry-picks and encountered a couple of issues. The first is a simple missing group which I've left a comment for on the gerrit changeset and amended our cherry-pick to fix, the other is around elastalert.util.EAException: Invalid Rule file: /etc/elastalert/security/rules/base.inc20190503-1970-1rnja7p

Thanks! Yes the validation should be disabled for base.inc, I did so in the last PS of https://gerrit.wikimedia.org/r/c/operations/puppet/+/505762 but won't have time today to rebase/cherry-pick/test, although let me know if you still run into problems!

No rebase problems today, just had to fix Bool -> Boolean and validate_cmd undef instead of '' in the commit and re-cherry-pick it. Puppet is happy on the new logstash03 instance now.

Thank you! re: migration of kafka and elasticsearch me, and/or @colewhite and/or @herron would be able to assist

Krenair updated the task description. (Show Details)May 11 2019, 6:19 PM
Krenair updated the task description. (Show Details)May 14 2019, 11:56 AM
Krenair updated the task description. (Show Details)May 15 2019, 7:08 PM
Krenair updated the task description. (Show Details)May 15 2019, 7:13 PM
Krenair updated the task description. (Show Details)May 15 2019, 7:20 PM
Krenair updated the task description. (Show Details)May 29 2019, 10:33 AM
Krenair updated the task description. (Show Details)May 29 2019, 12:32 PM
Krenair updated the task description. (Show Details)May 29 2019, 1:54 PM
Krenair updated the task description. (Show Details)May 29 2019, 4:26 PM
Krenair renamed this task from Migrate away from Debian Jessie to Debian Stretch to Migrate away from Debian Jessie to Debian Stretch/Buster.Jul 9 2019, 10:57 PM
Krenair updated the task description. (Show Details)
Krenair renamed this task from Migrate away from Debian Jessie to Debian Stretch/Buster to Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster.Sep 12 2019, 11:28 PM
hashar removed a subscriber: hashar.Oct 28 2019, 10:38 AM

@fgiunchedi Do we need to do anything else to get rid of deployment-logstash2 and use deployment-logstash03 instead? logstash2 now has a puppet error due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/522406 (T198092)

@fgiunchedi Do we need to do anything else to get rid of deployment-logstash2 and use deployment-logstash03 instead? logstash2 now has a puppet error due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/522406 (T198092)

If deployment-logstash03 has the same classes applied than deployment-logstash2 and no puppet errors I'd say the next step would be to switch producers to use deployment-logstash03 and the proxy to logstash-beta.wmflabs.org. It might help with T233134: logstash-beta.wmflabs.org does not receive any mediawiki events too

(I've removed most subscribers that came from the merger of T236575 to avoid spamming half of the wikimedia technical community. In the mean time logstash discussion moved to T238707 and I need to follow up)
note from https://wikitech.wikimedia.org/wiki/News/Jessie_deprecation#Cloud_VPS_projects:

In December 2019, deadline. Evaluate if Jessie VMs not migrated are actually in use and why they weren't migrated.
In January 2020, shutdown all Jessie instances (unless special arrangements have been made for extension of deadline).

FYI @bd808 the most common reason some of these hasn't been migrated is probably that I overestimated the amount of services that would be running inside containers in production by this time. Other reasons include things like the production equivalent hosts not yet having been upgraded.
For deployment-prep I would like to request exemption up to the equivalent deadline for prod (end of security support probably?)

Krenair updated the task description. (Show Details)Dec 14 2019, 11:34 PM
Krenair updated the task description. (Show Details)Dec 14 2019, 11:40 PM
Krenair updated the task description. (Show Details)Dec 14 2019, 11:43 PM
Krenair updated the task description. (Show Details)Dec 15 2019, 12:00 AM
Krenair updated the task description. (Show Details)Dec 15 2019, 12:05 AM

FYI @bd808 the most common reason some of these hasn't been migrated is probably that I overestimated the amount of services that would be running inside containers in production by this time. Other reasons include things like the production equivalent hosts not yet having been upgraded.
For deployment-prep I would like to request exemption up to the equivalent deadline for prod (end of security support probably?)

Ack. I assumed that deployment-prep would be among the last projects to fully remove Jessie. Thanks for all your hard work coordinating things here @Krenair.

Krenair updated the task description. (Show Details)Sun, Mar 8, 9:44 PM
Krenair updated the task description. (Show Details)Sun, Mar 8, 9:48 PM