User Details
- User Since
- May 5 2026, 7:27 AM (4 w, 6 d)
- Availability
- Available
- LDAP User
- CWilliams
- MediaWiki User
- CWilliams-WMF [ Global Accounts ]
Today
Thu, Jun 4
The remaining hosts are out of scope for this ticket:
% sudo cumin 'A:db-section-x3 and A:bookworm' 6 hosts will be targeted: clouddb[1016,1020,1022-1023].eqiad.wmnet,db2200.codfw.wmnet,db1216.eqiad.wmnet DRY-RUN mode enabled, aborting
Proceeding with the reimage of db1255 for T426725
Wed, Jun 3
The remaining hosts are out of scope for this ticket, marking as resolved:
% sudo cumin 'A:db-section-s8 and A:bookworm' 6 hosts will be targeted: an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db2198.codfw.wmnet,db1171.eqiad.wmnet,dbstore1009.eqiad.wmnet DRY-RUN mode enabled, aborting
@elukey yes, I did have an idea... but @Volans suggesting that making it part of the log messages from the calls adding the downtime would be preferential, given that it makes it available to every cookbook, presuming that there are no complications in doing that.
@Marostegui would the log message be enough for you?
Thanks!
Tue, Jun 2
AttributeError: module 'urllib3.exceptions' has no attribute 'SubjectAltNameWarning'
Another time that Icinga held red on "MariaDB sustained replica lag on <section>", despite replication having caught up about 10-15 minutes beforehand.
Manually cleared downtime once green and then repooled.
Mon, Jun 1
There is no default downtime, you need to pass --downtime=n and then it wil perform the downtime. Given that depooling can include decommissioning, it didn't seem to make sense to have a default value.
@FCeratto-WMF as mentioned, this is not something unique to pooling. For example, in sre.mysql.clone
step("icinga", "Disabling monitoring for source and target host")
source_alerter = self.alerting_hosts(self.source_host.hosts)
source_downtime_id = source_alerter.downtime(self.admin_reason, duration=timedelta(hours=8))@Marostegui no, the spicerack code does not do that nor does it look like other sre.mysql cookbooks show that information. However, the sre.hosts.downtime cookbook updates Phabricator with a comment. So, if you want it to be logged either to the console, Phabricator, or both then it looks like that is a more general task for sre.mysql, or do use the downtime cookbook directly. I will wait to hear your reply before I closed this ticket, as it was merged.
Wed, May 27
START - Cookbook sre.mysql.depool depool db2163: Testing cookbook
[cookbooks.sre.mysql.pool.depool] Setting downtime
Scheduling downtime on Icinga server alert1002.wikimedia.org for hosts: db2163
Created silence ID da7a2b6d-a0f4-46d9-a37a-b29ef98b503b
Previous configuration saved. To restore it run: dbctl config restore /var/cache/conftool/dbconfig/20260527-155100-cwilliams.json
dbctl commit (dc=codfw): 'Testing cookbook', diff saved to https://phabricator.wikimedia.org/P93277 and previous config saved to /var/cache/conftool/dbconfig/20260527-155100-cwilliams.json
Monitoring number of wikiuser* connections
Connection drain completed
Unable to access task : not adding comment 'Completed depooling of db2163 by cwilliams@cumin1003: Testing cookbook'
Released lock for key /spicerack/locks/cookbooks/sre.mysql.depool:db2163: {'concurrency': 1, 'created': '2026-05-27 15:50:51.057298', 'owner': 'cwilliams@cumin1003 [1809630]', 'ttl': 60}
END (PASS) - Cookbook sre.mysql.depool (exit_code=0) depool db2163: Testing cookbookWhilst the original cookbook was waiting for replication to catch-up, an exception caused the process to bail out: https://phabricator.wikimedia.org/P93266
Judging by Icinga, it appears that the service recovered at:
[2026-05-27 15:10:44] SERVICE ALERT: db1178;MariaDB sustained replica lag on s8;OK;HARD;5;(C)10 ge (W)5 ge 0
Tue, May 26
@Ladsgroup thanks for the quick response!
but we can easily test this
What is involved in doing this?
Fri, May 22
Thu, May 21
Pending repool whilst reimaging for T426725
Noting that there is a difference in MariaDB on the minor version, the other difference seem to be as expected:
24 config differences Variable db2241 db2162 ========================= ========================= ========================= general_log_file db2241.log db2162.log gtid_binlog_pos 171966580-171966580-79... 171966560-171966560-15... gtid_binlog_state 171966580-171966580-79... 171966560-171966560-15... gtid_current_pos 0-180359179-5751637176... 0-180359179-5751637176... gtid_domain_id 180356619 180359385 gtid_slave_pos 0-180359179-5751637176... 0-180359179-5751637176... hostname db2241 db2162 innodb_buffer_pool_size 405874409472 404800667648 innodb_buffer_pool_siz... 405874409472 404800667648 innodb_buffer_pool_siz... 405874409472 404800667648 log_bin_basename /srv/sqldata/db2241-bin /srv/sqldata/db2162-bin log_bin_index /srv/sqldata/db2241-bi... /srv/sqldata/db2162-bi... log_slow_query_file db2241-slow.log db2162-slow.log pid_file /srv/sqldata/db2241.pid /srv/sqldata/db2162.pid report_host db2241.codfw.wmnet db2162.codfw.wmnet rpl_semi_sync_master_e... ON OFF rpl_semi_sync_slave_en... OFF ON server_id 180356619 180359385 server_uid NJgxPfGLXWwZjiC6baS4tu... IhJ+cAGEva1poSuFcK3mtg... slow_query_log_file db2241-slow.log db2162-slow.log version 10.11.13-MariaDB-log 10.11.16-MariaDB-log version_source_revision 8fb09426b98583916ccfd4... 3218602d3100db9ce7a875... version_ssl_library OpenSSL 3.0.16 11 Feb ... OpenSSL 3.5.6 7 Apr 2026 wsrep_node_name db2241 db2162
Wed, May 20
Upgrading db1258.eqiad.wmnet
This ended up in a broken state as the management password that was entered was incorrect.
See https://phabricator.wikimedia.org/P92681 for related output
Tue, May 19
This host is ready for DC-Ops to decommission
This host is ready for DC-Ops to decommission
This host is ready for DC-Ops to decommission
This host is ready for DC-Ops to decommission
Mon, May 18
Merged on puppetserver1001

