Page MenuHomePhabricator

Service implementation for wdqs20[13-22]
Closed, ResolvedPublic5 Estimated Story Points

Description

New servers have been racked as requested in T326689. We still need to configure them and add them to rotation.

AC:

  • wdqs20[13-22] are configured and data is loaded and they are serving traffic
  • ticket to decommission old servers is created

Event Timeline

Current state: 2019 and 2020 are production-ready. The others need a data transfer and/or scap deploy to be complete. (More details about puppet/scap issues here ).

The command below checks the deployment directory size. If the directory size is smaller than 471M, that means git-fat isn't working and the host needs the entire contents of /srv/deployment/wdqs to be deleted. After that, re-deploying via scap should make the host ready for production.

sudo cumin wdqs20[14-22].codfw.wmnet 'du -hcxs /srv/deployment/wdqs/wdqs-cache/revs/dff41b7f460417eb6155aed96756ebe194261756'
===== NODE GROUP =====
(2) wdqs[2019-2020].codfw.wmnet
----- OUTPUT of 'du -hcxs /srv/de...6756ebe194261756' -----
471M    /srv/deployment/wdqs/wdqs-cache/revs/dff41b7f460417eb6155aed96756ebe194261756
471M    total
===== NODE GROUP =====
(7) wdqs[2014-2018,2021-2022].codfw.wmnet
----- OUTPUT of 'du -hcxs /srv/de...6756ebe194261756' -----
132M    /srv/deployment/wdqs/wdqs-cache/revs/dff41b7f460417eb6155aed96756ebe194261756
132M    total
================

Update: wdqs[2017-2021].codfw.wmnet are now production ready:

===== NODE GROUP =====
(4) wdqs[2014-2016,2022].codfw.wmnet
----- OUTPUT of 'du -hcxs /srv/de...6756ebe194261756' -----
132M    /srv/deployment/wdqs/wdqs-cache/revs/dff41b7f460417eb6155aed96756ebe194261756
132M    total
===== NODE GROUP =====
(5) wdqs[2017-2021].codfw.wmnet
----- OUTPUT of 'du -hcxs /srv/de...6756ebe194261756' -----
471M    /srv/deployment/wdqs/wdqs-cache/revs/dff41b7f460417eb6155aed96756ebe194261756
471M    total
================

I pooled wdqs2020 for a few minutes earlier today, but I depooled it as I think it's better to come back on Monday when we have a chance to look at it more closely.

Update: I forgot to target 2013 in my last command, here is the latest list of hosts that need a data transfer and a deploy:

(4) wdqs[2013-2016].codfw.wmnet
----- OUTPUT of 'du -hcxs /srv/de...6756ebe194261756' -----
132M    /srv/deployment/wdqs/wdqs-cache/revs/dff41b7f460417eb6155aed96756ebe194261756
132M    total

Change 937535 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] wdqs.data-transfer: Keep downtime

https://gerrit.wikimedia.org/r/937535

Change 937572 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: disable alerts for new hosts

https://gerrit.wikimedia.org/r/937572

Change 937572 merged by Ryan Kemper:

[operations/puppet@production] wdqs: disable alerts for new hosts

https://gerrit.wikimedia.org/r/937572

Update: wdqs2016.codfw.wmnet is the last host that needs to be configured for production.

wdqs2020.codfw.wmnet has been receiving production traffic for a week now, with no observed issues.

We should be able to finish the rest pretty soon and start decommissioning the older hosts.

Change 940180 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: fix missing entry in site.pp

https://gerrit.wikimedia.org/r/940180

Change 940180 merged by Bking:

[operations/puppet@production] wdqs: fix missing entry in site.pp

https://gerrit.wikimedia.org/r/940180

Change 940240 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: re-enable alerting on now-in-svc hosts

https://gerrit.wikimedia.org/r/940240

Change 940240 merged by Ryan Kemper:

[operations/puppet@production] wdqs: re-enable alerting on now-in-svc hosts

https://gerrit.wikimedia.org/r/940240

All of these hosts except wdqs202[1-2] are in service. Those last two hosts will be brought in service after a final data xfer (ongoing).

Change 940272 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: re-enable alerting on last 2 new hosts

https://gerrit.wikimedia.org/r/940272

Change 940272 merged by Ryan Kemper:

[operations/puppet@production] wdqs: re-enable alerting on last 2 new hosts

https://gerrit.wikimedia.org/r/940272

wdqs202[1-2] have been brought into service. With teh merging of https://gerrit.wikimedia.org/r/c/operations/puppet/+/940272, all hosts are now in service and have alerting enabled.

This ticket's done. Next up, decom'ing the old hosts per T342035

Gehel triaged this task as High priority.
Gehel moved this task from Needs Reporting to Done on the Data-Platform-SRE board.
RKemper renamed this task from Configure new WDQS servers in codfw (wdqs20[13-22]) to Service implementation for wdqs20[13-22].Dec 8 2023, 7:21 PM