New servers have been racked as requested in T326689. We still need to configure them and add them to rotation.
AC:
- wdqs20[13-22] are configured and data is loaded and they are serving traffic
- ticket to decommission old servers is created
New servers have been racked as requested in T326689. We still need to configure them and add them to rotation.
AC:
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | bking | T332314 Service implementation for wdqs20[13-22] | |||
Resolved | bking | T331300 Ensure WDQS stack works on Bullseye | |||
Resolved | bking | T336540 Ensure prometheus-blazegraph-exporter-wdqs-* services can start in Bullseye or later | |||
Resolved | bking | T336443 Investigate performance differences between wdqs2022 and older hosts | |||
Resolved | VRiley-WMF | T358727 Reclaim recently-decommed CP host for WDQS (see T352253) | |||
Resolved | bking | T340793 Implement depool (source only) and keep-downtime options on data-transfer cookbook | |||
Resolved | Gehel | T342060 Investigate WDQS categories update failures on Bullseye hosts | |||
In Progress | Sandeeps | T342162 "scap deploy"'s config-deploy should check for broken symlinks | |||
Resolved | bking | T330714 Document SRE steps for deploying a new WDQS (and WCQS) host |
Merged patch (had wrong ticket in commit message): https://gerrit.wikimedia.org/r/c/operations/puppet/+/934403
Current state: 2019 and 2020 are production-ready. The others need a data transfer and/or scap deploy to be complete. (More details about puppet/scap issues here ).
The command below checks the deployment directory size. If the directory size is smaller than 471M, that means git-fat isn't working and the host needs the entire contents of /srv/deployment/wdqs to be deleted. After that, re-deploying via scap should make the host ready for production.
sudo cumin wdqs20[14-22].codfw.wmnet 'du -hcxs /srv/deployment/wdqs/wdqs-cache/revs/dff41b7f460417eb6155aed96756ebe194261756' ===== NODE GROUP ===== (2) wdqs[2019-2020].codfw.wmnet ----- OUTPUT of 'du -hcxs /srv/de...6756ebe194261756' ----- 471M /srv/deployment/wdqs/wdqs-cache/revs/dff41b7f460417eb6155aed96756ebe194261756 471M total ===== NODE GROUP ===== (7) wdqs[2014-2018,2021-2022].codfw.wmnet ----- OUTPUT of 'du -hcxs /srv/de...6756ebe194261756' ----- 132M /srv/deployment/wdqs/wdqs-cache/revs/dff41b7f460417eb6155aed96756ebe194261756 132M total ================
Update: wdqs[2017-2021].codfw.wmnet are now production ready:
===== NODE GROUP ===== (4) wdqs[2014-2016,2022].codfw.wmnet ----- OUTPUT of 'du -hcxs /srv/de...6756ebe194261756' ----- 132M /srv/deployment/wdqs/wdqs-cache/revs/dff41b7f460417eb6155aed96756ebe194261756 132M total ===== NODE GROUP ===== (5) wdqs[2017-2021].codfw.wmnet ----- OUTPUT of 'du -hcxs /srv/de...6756ebe194261756' ----- 471M /srv/deployment/wdqs/wdqs-cache/revs/dff41b7f460417eb6155aed96756ebe194261756 471M total ================
I pooled wdqs2020 for a few minutes earlier today, but I depooled it as I think it's better to come back on Monday when we have a chance to look at it more closely.
Update: I forgot to target 2013 in my last command, here is the latest list of hosts that need a data transfer and a deploy:
(4) wdqs[2013-2016].codfw.wmnet ----- OUTPUT of 'du -hcxs /srv/de...6756ebe194261756' ----- 132M /srv/deployment/wdqs/wdqs-cache/revs/dff41b7f460417eb6155aed96756ebe194261756 132M total
Change 937535 had a related patch set uploaded (by Bking; author: Bking):
[operations/cookbooks@master] wdqs.data-transfer: Keep downtime
Change 937572 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] wdqs: disable alerts for new hosts
Change 937572 merged by Ryan Kemper:
[operations/puppet@production] wdqs: disable alerts for new hosts
Update: wdqs2016.codfw.wmnet is the last host that needs to be configured for production.
wdqs2020.codfw.wmnet has been receiving production traffic for a week now, with no observed issues.
We should be able to finish the rest pretty soon and start decommissioning the older hosts.
Change 940180 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] wdqs: fix missing entry in site.pp
Change 940180 merged by Bking:
[operations/puppet@production] wdqs: fix missing entry in site.pp
Change 940240 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] wdqs: re-enable alerting on now-in-svc hosts
Change 940240 merged by Ryan Kemper:
[operations/puppet@production] wdqs: re-enable alerting on now-in-svc hosts
All of these hosts except wdqs202[1-2] are in service. Those last two hosts will be brought in service after a final data xfer (ongoing).
Change 940272 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/puppet@production] wdqs: re-enable alerting on last 2 new hosts
Change 940272 merged by Ryan Kemper:
[operations/puppet@production] wdqs: re-enable alerting on last 2 new hosts
wdqs202[1-2] have been brought into service. With teh merging of https://gerrit.wikimedia.org/r/c/operations/puppet/+/940272, all hosts are now in service and have alerting enabled.
This ticket's done. Next up, decom'ing the old hosts per T342035