Page MenuHomePhabricator

Productionize db125[0-4]
Closed, ResolvedPublic

Description

  • db1250 floating host for m1
  • db1251 s1
  • db1252 s4
  • db1253 s7
  • db1254 replaces db1156 in s2

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui changed the task status from Open to Stalled.Jan 30 2025, 8:01 AM
Marostegui changed the task status from Stalled to Open.Jan 31 2025, 6:30 AM
Marostegui moved this task from Blocked to Ready on the DBA board.

Hosts are ready

[06:30:45] marostegui@cumin1002:~$ sudo cumin 'db125[0-4].eqiad.wmnet' 'lvextend -L+1000G /dev/mapper/tank-data ; xfs_growfs /srv ; df -hT /srv'
5 hosts will be targeted:
db[1250-1254].eqiad.wmnet
OK to proceed on 5 hosts? Enter the number of affected hosts to confirm or "q" to quit: 5
===== NODE GROUP =====
(5) db[1250-1254].eqiad.wmnet
----- OUTPUT of 'lvextend -L+1000...rv ; df -hT /srv' -----
  Size of logical volume tank/data changed from <7.56 TiB (1981022 extents) to 8.53 TiB (2237022 extents).
  Logical volume tank/data successfully resized.
meta-data=/dev/mapper/tank-data  isize=512    agcount=32, agsize=63392704 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=4096   blocks=2028566528, imaxpct=5
         =                       sunit=64     swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 2028566528 to 2290710528
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   8.6T   61G  8.5T   1% /srv
================
PASS |██████████████████████████████████████████████████████████████████████████████████████████| 100% (5/5) [00:00<00:00,  6.47hosts/s]
FAIL |                                                                                                  |   0% (0/5) [00:00<?, ?hosts/s]
100.0% (5/5) success ratio (>= 100.0% threshold) for command: 'lvextend -L+1000...rv ; df -hT /srv'.
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Mentioned in SAL (#wikimedia-operations) [2025-02-03T15:37:56Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Depool db1169.eqiad.wmnet T385141', diff saved to https://phabricator.wikimedia.org/P73093 and previous config saved to /var/cache/conftool/dbconfig/20250203-153755-fceratto.json

Icinga downtime and Alertmanager silence (ID=bd8dc753-0a36-4dc5-8871-155989536f35) set by fceratto@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: provisioning - T385141

db1169.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-02-03T15:40:48Z] <fceratto@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1169.eqiad.wmnet with reason: provisioning - T385141

Icinga downtime and Alertmanager silence (ID=288ca8b7-f64e-4619-b10e-5de3227331f0) set by fceratto@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: provisioning - T385141

db1251.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-02-03T15:41:44Z] <fceratto@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1251.eqiad.wmnet with reason: provisioning - T385141

Change #1116828 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod

https://gerrit.wikimedia.org/r/1116828

Change #1116828 merged by Federico Ceratto:

[operations/puppet@production] instances.yaml,db1251.yaml,site.pp: Prepare db1251 for prod

https://gerrit.wikimedia.org/r/1116828

Mentioned in SAL (#wikimedia-operations) [2025-02-03T16:37:22Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Add db1251.eqiad.wmnet T385141', diff saved to https://phabricator.wikimedia.org/P73096 and previous config saved to /var/cache/conftool/dbconfig/20250203-163722-fceratto.json

Change #1117501 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] db1251.yaml: enable monitoring

https://gerrit.wikimedia.org/r/1117501

Change #1117501 merged by Federico Ceratto:

[operations/puppet@production] db1251.yaml: enable monitoring

https://gerrit.wikimedia.org/r/1117501

Mentioned in SAL (#wikimedia-operations) [2025-02-28T11:43:03Z] <fceratto@cumin1002> DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db1252.eqiad.wmnet with reason: preparing - T385141

Change #1123654 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] db1252.yaml, instances.yaml, site.pp: Prepare db1252 for prod

https://gerrit.wikimedia.org/r/1123654

Change #1123654 merged by Federico Ceratto:

[operations/puppet@production] db1252.yaml, instances.yaml, site.pp: Prepare db1252 for prod

https://gerrit.wikimedia.org/r/1123654

Start pool of db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed - fceratto@cumin1002

Mentioned in SAL (#wikimedia-operations) [2025-03-03T15:11:04Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Pooling in after cloning to db1252 T385141', diff saved to https://phabricator.wikimedia.org/P73976 and previous config saved to /var/cache/conftool/dbconfig/20250303-151103-fceratto.json

Start pool of db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed - fceratto@cumin1002

Completed pool of db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed - fceratto@cumin1002

Start pool of db1252 gradually with 4 steps - Cloned db124 to db1252 - fceratto@cumin1002

Start pool of db1252 gradually with 4 steps - Cloned db124 to db1252 - fceratto@cumin1002

Change #1124379 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] db1253.yaml: prepare for production

https://gerrit.wikimedia.org/r/1124379

Change #1124379 merged by Federico Ceratto:

[operations/puppet@production] db1253.yaml: prepare for production

https://gerrit.wikimedia.org/r/1124379

Change #1124641 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Productionize db1250

https://gerrit.wikimedia.org/r/1124641

Change #1124641 merged by Marostegui:

[operations/puppet@production] mariadb: Productionize db1250

https://gerrit.wikimedia.org/r/1124641

Mentioned in SAL (#wikimedia-operations) [2025-03-05T09:07:04Z] <marostegui> Stop db1217:3321 to clone db1250 T385141

Start pool of db1202 gradually with 4 steps - Cloned db1202 to db1253 - fceratto@cumin1002

Start pool of db1202 gradually with 4 steps - Cloned db1202 to db1253 - fceratto@cumin1002

Start pool of db1202 gradually with 4 steps - Cloned db1202 to db1253 - fceratto@cumin1002

Change #1124740 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] instances.yaml, db1253.yaml, db1254.yaml, site.pp: clone db1253 and db1254

https://gerrit.wikimedia.org/r/1124740

Change #1124740 merged by Federico Ceratto:

[operations/puppet@production] instances.yaml, db1253.yaml, db1254.yaml, site.pp: clone db1253 and db1254

https://gerrit.wikimedia.org/r/1124740

sre.mysql.clone --source db1233.eqiad.wmnet --target db1254.eqiad.wmnet ran and pooled in the host after baking it for 1h.

db1253 is also cloned but I'd like to test the script again as I've just implemented task updates, it should take few hours

db1253 is also cloned but I'd like to test the script again as I've just implemented task updates, it should take few hours

No problem from my side!

Started cloning db1202.eqiad.wmnet to db1253.eqiad.wmnet - fceratto@cumin1002

Finished cloning db1202.eqiad.wmnet to db1253.eqiad.wmnet - fceratto@cumin1002

FCeratto-WMF updated the task description. (Show Details)

Unfortunately there's more work to be done around https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1120605 and its related task https://phabricator.wikimedia.org/T387023 to add additional checks on the notifications on icinga and on the dbctl configuration.

Mentioned in SAL (#wikimedia-operations) [2025-03-10T15:53:32Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Preparing db1253 T385141', diff saved to https://phabricator.wikimedia.org/P74174 and previous config saved to /var/cache/conftool/dbconfig/20250310-155332-fceratto.json

Mentioned in SAL (#wikimedia-operations) [2025-03-11T10:30:15Z] <fceratto@cumin1002> START - Cookbook sre.mysql.pool db1253 gradually with 4 steps - Pool in for T385141

Mentioned in SAL (#wikimedia-operations) [2025-03-11T11:15:58Z] <fceratto@cumin1002> END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1253 gradually with 4 steps - Pool in for T385141

Mentioned in SAL (#wikimedia-operations) [2025-03-11T11:48:36Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Preparing db1254 for T385141', diff saved to https://phabricator.wikimedia.org/P74183 and previous config saved to /var/cache/conftool/dbconfig/20250311-114835-fceratto.json

Mentioned in SAL (#wikimedia-operations) [2025-03-11T11:50:13Z] <fceratto@cumin1002> START - Cookbook sre.mysql.pool db1254 gradually with 4 steps - Pool in for T385141

Mentioned in SAL (#wikimedia-operations) [2025-03-11T12:35:57Z] <fceratto@cumin1002> END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1254 gradually with 4 steps - Pool in for T385141

Closing task hopefully for real this time 😅

Ladsgroup subscribed.

db1252 is not pooled, it's not in dbctl at all (I tried to depool it for T396648 but I got instance is uninitialized error)

Start pool of db1252* slowly with 10 steps - Pooling in - fceratto@cumin1002

Start pool of db1252* slowly with 10 steps - Pooling in - fceratto@cumin1002

Completed pool of db1252* slowly with 10 steps - Pooling in - fceratto@cumin1002

Completed pool of db1252* slowly with 10 steps - Pooling in - fceratto@cumin1002

This is still not really completed as the host is still depooled for API. Main traffic is fully repooled, but API one isn't.

@FCeratto-WMF ^ please repool this host in API and once done this task can be closed again.

Mentioned in SAL (#wikimedia-operations) [2025-06-20T13:04:24Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Pool in API for db1252 - see T385141', diff saved to https://phabricator.wikimedia.org/P78541 and previous config saved to /var/cache/conftool/dbconfig/20250620-130423-fceratto.json

@Marostegui I added the API section for db1252 - do we have any tooling to check if weights are set correctly across all hosts in a section. @Ladsgroup perhaps you have any script for this?

@Marostegui I added the API section for db1252 - do we have any tooling to check if weights are set correctly across all hosts in a section. @Ladsgroup perhaps you have any script for this?

We don't have any of that. What you can do is just mimic the weights of other API hosts on that section. And once done you can close this task.

Mentioned in SAL (#wikimedia-operations) [2025-06-25T10:22:26Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Set db1252 weight to 300 - see T385141', diff saved to https://phabricator.wikimedia.org/P78677 and previous config saved to /var/cache/conftool/dbconfig/20250625-102225-fceratto.json

I updated the weights for db1252 to mimic its peers, see https://zarcillo.wikimedia.org/ui/weights