Move deployment-prep redis instances to stretch
Closed, ResolvedPublic

Description

As part of porting redis metrics to Prometheus I wanted to test the metrics in deployment-prep. The redis instances running are trusty, so I've provisioned two new redis instances with stretch (deployment-redis0[56]). We'll need to make them in active use and remove the old trusty instances.

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptOct 31 2017, 11:09 AM

AFAIU this is the procedure to commission new redis instances:

  • add 05 and 06 to redis::shards in https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep (just one redis instance each)
  • set 05 as slave of 01 (ditto for 06 -> 02)
  • change wmf-config/LabsServices.php to point to 05 and 06
  • change wmf-config/jobqueue-labs.php to point to 05
  • change hieradata/labs/deployment-prep/common.yaml jobrunner config to point to 05
  • verify all redis traffic is hitting 05/06 and not 01/02
  • break replication, 05/06 are now masters
  • decom 01/02

cc @Joe @elukey @hashar

Change 387570 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/mediawiki-config@master] labs: use new redis servers for locks

https://gerrit.wikimedia.org/r/387570

Change 386869 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add redis stretch deployment-prep instances

https://gerrit.wikimedia.org/r/386869

Change 387579 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: use deployment-redis05 for labs jobrunner

https://gerrit.wikimedia.org/r/387579

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Nov 6 2017, 2:29 PM

deployment-redis01 and deployment-redis02 have puppet failure due to the prometheus redis_exporter requiring systemd:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: {"message":"Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, You can only use systemd resources on systems with systemd, got upstart at /etc/puppet/modules/systemd/manifests/init.pp:8:9 at /etc/puppet/modules/prometheus/manifests/redis_exporter.pp:49 on node deployment-redis01.deployment-prep.eqiad.wmflabs","issue_kind":"RUNTIME_ERROR","stacktrace":["Warning: The 'stacktrace' property is deprecated and will be removed in a future version of Puppet. For security reasons, stacktraces are not returned with Puppet HTTP Error responses."]}

So IIRC @Pchelolo has finished running their tests in deployment-prep that used redis. So we could actually move forward with the above patches and move redis to stretch in deployment-prep, sounds good Release-Engineering-Team ?

mmodell added a subscriber: mmodell.Jan 8 2018, 9:46 PM

@fgiunchedi sounds good to me! Puppet is now broken on the old redis nodes, as @hashar mentioned above.

fgiunchedi moved this task from Doing to Backlog on the User-fgiunchedi board.Feb 5 2018, 3:17 PM
fgiunchedi removed fgiunchedi as the assignee of this task.Apr 6 2018, 8:06 AM

Indeed, I'm removing myself as assignee since I won't have time to work on moving over to the new redist stretch instances in deployment-prep.
For reference these are the related reviews:

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.Apr 6 2018, 8:06 AM

These are two out of four remaining trusty instances in deployment-prep, and continuously failing puppet for months. I wonder whether they still serve any purpose or should just be deleted - and if they are meant to stay, who would be responsible for upgrading them/resolving the puppet errors.

Krenair added a comment.EditedMay 27 2018, 8:54 PM

@fgiunchedi: So we basically need to find someone to review the puppet patches (or cherry-pick them) and merge the mediawiki-config patch, then shut down the old instances and remove references to them? If so I can probably push this across the finish line.

It looks like @Joe has made and merged patches that essentially obsolete those, and I can't find any remaining references to the old instances. I'm going to shut down the old instances and see if anything breaks.

Mentioned in SAL (#wikimedia-releng) [2018-06-09T02:13:30Z] <Krenair> shut down old deployment-redis01 and deployment-redis02 instances T179371

hashar removed a subscriber: hashar.Jun 9 2018, 5:48 AM

Change 387570 abandoned by Filippo Giunchedi:
labs: use new redis servers for locks

Reason:
As per Alex "Obsoleted by Ia65009dc"

https://gerrit.wikimedia.org/r/387570

Change 386869 abandoned by Filippo Giunchedi:
hieradata: add redis stretch deployment-prep instances

Reason:
Indeed, as per Alex "Obsoleted by I411fcef3"

https://gerrit.wikimedia.org/r/386869

Change 387579 abandoned by Filippo Giunchedi:
hieradata: use deployment-redis05 for labs jobrunner

Reason:
As per Alex "Obsoleted by I411fcef3"

https://gerrit.wikimedia.org/r/387579

Alright. Leaving open pending deletion of the old redis hosts in a few weeks then?

Change 386869 restored by Krinkle:
hieradata: add redis stretch deployment-prep instances

Reason:
Still beta-picked

https://gerrit.wikimedia.org/r/386869

Change 386869 abandoned by Krinkle:
hieradata: add redis stretch deployment-prep instances

https://gerrit.wikimedia.org/r/386869

Mentioned in SAL (#wikimedia-releng) [2018-07-08T16:54:10Z] <Krenair> deleted deployment-redis02 T179371

Mentioned in SAL (#wikimedia-releng) [2018-07-08T16:54:59Z] <Krenair> deleted deployment-redis01 T179371

Krenair closed this task as Resolved.Jul 13 2018, 8:30 PM
Krenair assigned this task to fgiunchedi.