Hey folks, is this on track to happen? Anything else you need from our side?
- Make pc1009 the pc3 primary again: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/700026
- Run stop slave; reset slave all on pc1009, so it stops replicating from pc1010
- Move pc1010 back to pc1: TODO
- Truncate pc1010
- Set pc1010 replicating from pc1007
Mon, Jun 14
Sat, Jun 12
Ohh, right, of course. Never mind then :)
Fri, Jun 11
Optimize of pc1009 (and replica) finished.
Thu, Jun 10
Jun 8th: ~28h
Jun 9th: ~15.5h
Jun 10th: still running
Wed, Jun 9
Yes please to both of those :)
Procedure looks good :)
Tue, Jun 8
Optimize of pc1008 (and replica) finished.
db1157 had a clean mysqlcheck run, repooling it now.
Mon, Jun 7
db1157 upgraded to buster. Running mysqlcheck now.
Re-labelling not necessary, as it wasn't re-labelled away from db1125 in the first place: T283300
Run finished at 2021-06-05T14:30. Running optimize over all pc* tables now.
Fri, Jun 4
15:39:55 <Krinkle> kormat: it's running now, tee'ed to /home/krinkle/purge_parsercache_now_pc1008.log
Thu, Jun 3
It's back in tendril+zarcillo, and is a replica of db1124.
pc1010 is now pc2 primary, and is no longer replicating from pc1008:
Wed, Jun 2
- db1125 has been renamed, wiped, and reimaged
- It still needs to be re-added to tendril/zarcillo, and have an s4 snapshot deployed on it.
Tue, Jun 1
Thu, May 27
redact_sanitarium.sh completed, and a quick check showed it had been successful.
- Data copy from db2082 completed.
- mysql_upgrade ran
- redact_sanitarium.sh currently running.
- pc1 is repooled and back in service.
- pc1010 is now in pc2, and replicating from pc1008. This means it will have at least _some_ relevant entries when it becomes pc2 primary next week.
sudo transfer.py --type file --no-compress --no-encrypt --no-checksum db2082.codfw.wmnet:/srv/sqldata db2094.codfw.wmnet:/srv/sqldata.s8
db2082 is db2094:s8's master:
firstname.lastname@example.org[(none)]> stop slave; Query OK, 0 rows affected (0.036 sec)
Optimize of pc1007 (and replicas) finished.
Wed, May 26
The purge has finished as of 2021-05-26T06:00Z. I'll start the optimize process now.
Tue, May 25
I wouldn't say it's very _urgent_, but it would definitely be nice to have it done before the dc switchover preperations, for sanity-checking that circular replication is set up correctly.
I'd vote for a non-'bot'-specific channel name (so -bulk or -firehose or similar). Other than that, LGTM.
- pc1010 is now the primary for pc1
- I've run stop slave on pc1010, so it no longer replicates from pc1007
- I've created a downtime for 7 days for pc[2007,2010].codfw.wmnet,pc1007.eqiad.wmnet
Fri, May 21
For posterity, here's the script i used for the heartbeat changes:
The grant for cumin2002 should now be fully deployed.
Optimize of pc1010 finished.
Yes indeed 🎉
Thu, May 20
This is now fixed. Puppet will no longer start/stop heartbeat. That is managed by db-switchover when changing masters. This does mean that pt-heartbeat-wikimedia needs to be started manually after a boot, however.
kormat@cumin1001:~(0:0)$ sudo debdeploy deploy -u 2021-05-20-wmfmariadbpy.yaml -Q C:wmfmariadbpy Rolling out wmfmariadbpy: Non-daemon update, no service restart needed
Heartbeat restarted on all primaries.
pt-heartbeat-wikimedia fails to start on db2093 with:
DBD::mysql::st execute failed: Cannot execute statement: impossible to write to binary log since BINLOG_FORMAT = STATEMENT and at least one table uses a storage engine limited to row-based logging. InnoDB is limited to row-logging when transaction isolation level is READ COMMITTED or READ UNCOMMITTED.
This is due to the unusual config of the dbinventory section.