- db1250 floating host for m1
- db1251 s1
- db1252 s4
- db1253 s7
- db1254 replaces db1156 in s2
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Unknown Object (Task) | |||||
| Resolved | Papaul | T380083 Q2:rack/setup/install db125[0-4] | |||
| Resolved | FCeratto-WMF | T385141 Productionize db125[0-4] | |||
| Resolved | • Marostegui | T388024 Switch m1 master db1164 -> db1250 |
Event Timeline
[06:30:45] marostegui@cumin1002:~$ sudo cumin 'db125[0-4].eqiad.wmnet' 'lvextend -L+1000G /dev/mapper/tank-data ; xfs_growfs /srv ; df -hT /srv'
5 hosts will be targeted:
db[1250-1254].eqiad.wmnet
OK to proceed on 5 hosts? Enter the number of affected hosts to confirm or "q" to quit: 5
===== NODE GROUP =====
(5) db[1250-1254].eqiad.wmnet
----- OUTPUT of 'lvextend -L+1000...rv ; df -hT /srv' -----
Size of logical volume tank/data changed from <7.56 TiB (1981022 extents) to 8.53 TiB (2237022 extents).
Logical volume tank/data successfully resized.
meta-data=/dev/mapper/tank-data isize=512 agcount=32, agsize=63392704 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1 bigtime=1 inobtcount=1 nrext64=0
data = bsize=4096 blocks=2028566528, imaxpct=5
= sunit=64 swidth=256 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
data blocks changed from 2028566528 to 2290710528
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/tank-data xfs 8.6T 61G 8.5T 1% /srv
================
PASS |██████████████████████████████████████████████████████████████████████████████████████████| 100% (5/5) [00:00<00:00, 6.47hosts/s]
FAIL | | 0% (0/5) [00:00<?, ?hosts/s]
100.0% (5/5) success ratio (>= 100.0% threshold) for command: 'lvextend -L+1000...rv ; df -hT /srv'.
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.Mentioned in SAL (#wikimedia-operations) [2025-02-03T15:37:56Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Depool db1169.eqiad.wmnet T385141', diff saved to https://phabricator.wikimedia.org/P73093 and previous config saved to /var/cache/conftool/dbconfig/20250203-153755-fceratto.json
Icinga downtime and Alertmanager silence (ID=bd8dc753-0a36-4dc5-8871-155989536f35) set by fceratto@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: provisioning - T385141
db1169.eqiad.wmnet
Mentioned in SAL (#wikimedia-operations) [2025-02-03T15:40:48Z] <fceratto@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1169.eqiad.wmnet with reason: provisioning - T385141
Icinga downtime and Alertmanager silence (ID=288ca8b7-f64e-4619-b10e-5de3227331f0) set by fceratto@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: provisioning - T385141
db1251.eqiad.wmnet
Mentioned in SAL (#wikimedia-operations) [2025-02-03T15:41:44Z] <fceratto@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1251.eqiad.wmnet with reason: provisioning - T385141
Change #1116828 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):
[operations/puppet@production] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod
Change #1116828 merged by Federico Ceratto:
[operations/puppet@production] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod
Mentioned in SAL (#wikimedia-operations) [2025-02-03T16:37:22Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Add db1251.eqiad.wmnet T385141', diff saved to https://phabricator.wikimedia.org/P73096 and previous config saved to /var/cache/conftool/dbconfig/20250203-163722-fceratto.json
Change #1117501 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):
[operations/puppet@production] db1251.yaml: enable monitoring
Change #1117501 merged by Federico Ceratto:
[operations/puppet@production] db1251.yaml: enable monitoring
Mentioned in SAL (#wikimedia-operations) [2025-02-28T11:43:03Z] <fceratto@cumin1002> DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db1252.eqiad.wmnet with reason: preparing - T385141
Change #1123654 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):
[operations/puppet@production] db1252.yaml, instances.yaml, site.pp: Prepare db1252 for prod
Change #1123654 merged by Federico Ceratto:
[operations/puppet@production] db1252.yaml, instances.yaml, site.pp: Prepare db1252 for prod
Start pool of db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed - fceratto@cumin1002
Mentioned in SAL (#wikimedia-operations) [2025-03-03T15:11:04Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Pooling in after cloning to db1252 T385141', diff saved to https://phabricator.wikimedia.org/P73976 and previous config saved to /var/cache/conftool/dbconfig/20250303-151103-fceratto.json
Start pool of db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed - fceratto@cumin1002
Completed pool of db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed - fceratto@cumin1002
Start pool of db1252 gradually with 4 steps - Cloned db124 to db1252 - fceratto@cumin1002
Start pool of db1252 gradually with 4 steps - Cloned db124 to db1252 - fceratto@cumin1002
Change #1124379 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):
[operations/puppet@production] db1253.yaml: prepare for production
Change #1124379 merged by Federico Ceratto:
[operations/puppet@production] db1253.yaml: prepare for production
Change #1124641 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] mariadb: Productionize db1250
Change #1124641 merged by Marostegui:
[operations/puppet@production] mariadb: Productionize db1250
Mentioned in SAL (#wikimedia-operations) [2025-03-05T09:07:04Z] <marostegui> Stop db1217:3321 to clone db1250 T385141
Start pool of db1202 gradually with 4 steps - Cloned db1202 to db1253 - fceratto@cumin1002
Start pool of db1202 gradually with 4 steps - Cloned db1202 to db1253 - fceratto@cumin1002
Start pool of db1202 gradually with 4 steps - Cloned db1202 to db1253 - fceratto@cumin1002
Change #1124740 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):
[operations/puppet@production] instances.yaml, db1253.yaml, db1254.yaml, site.pp: clone db1253 and db1254
Change #1124740 merged by Federico Ceratto:
[operations/puppet@production] instances.yaml, db1253.yaml, db1254.yaml, site.pp: clone db1253 and db1254
sre.mysql.clone --source db1233.eqiad.wmnet --target db1254.eqiad.wmnet ran and pooled in the host after baking it for 1h.
db1253 is also cloned but I'd like to test the script again as I've just implemented task updates, it should take few hours
Unfortunately there's more work to be done around https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1120605 and its related task https://phabricator.wikimedia.org/T387023 to add additional checks on the notifications on icinga and on the dbctl configuration.
Mentioned in SAL (#wikimedia-operations) [2025-03-10T15:53:32Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Preparing db1253 T385141', diff saved to https://phabricator.wikimedia.org/P74174 and previous config saved to /var/cache/conftool/dbconfig/20250310-155332-fceratto.json
Mentioned in SAL (#wikimedia-operations) [2025-03-11T10:30:15Z] <fceratto@cumin1002> START - Cookbook sre.mysql.pool db1253 gradually with 4 steps - Pool in for T385141
Mentioned in SAL (#wikimedia-operations) [2025-03-11T11:15:58Z] <fceratto@cumin1002> END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1253 gradually with 4 steps - Pool in for T385141
Mentioned in SAL (#wikimedia-operations) [2025-03-11T11:48:36Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Preparing db1254 for T385141', diff saved to https://phabricator.wikimedia.org/P74183 and previous config saved to /var/cache/conftool/dbconfig/20250311-114835-fceratto.json
Mentioned in SAL (#wikimedia-operations) [2025-03-11T11:50:13Z] <fceratto@cumin1002> START - Cookbook sre.mysql.pool db1254 gradually with 4 steps - Pool in for T385141
Mentioned in SAL (#wikimedia-operations) [2025-03-11T12:35:57Z] <fceratto@cumin1002> END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1254 gradually with 4 steps - Pool in for T385141
This is still not really completed as the host is still depooled for API. Main traffic is fully repooled, but API one isn't.
@FCeratto-WMF ^ please repool this host in API and once done this task can be closed again.
Mentioned in SAL (#wikimedia-operations) [2025-06-20T13:04:24Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Pool in API for db1252 - see T385141', diff saved to https://phabricator.wikimedia.org/P78541 and previous config saved to /var/cache/conftool/dbconfig/20250620-130423-fceratto.json
@Marostegui I added the API section for db1252 - do we have any tooling to check if weights are set correctly across all hosts in a section. @Ladsgroup perhaps you have any script for this?
We don't have any of that. What you can do is just mimic the weights of other API hosts on that section. And once done you can close this task.
Mentioned in SAL (#wikimedia-operations) [2025-06-25T10:22:26Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Set db1252 weight to 300 - see T385141', diff saved to https://phabricator.wikimedia.org/P78677 and previous config saved to /var/cache/conftool/dbconfig/20250625-102225-fceratto.json
I updated the weights for db1252 to mimic its peers, see https://zarcillo.wikimedia.org/ui/weights