Thu, Sep 20
I have a first quick and dirty script to check for backup freshness:
Not that it changes a lot, but indeed this doesn't happen only on COMMIT, also on ROLLBACK: Wikimedia\Rdbms\Database::close 10.192.32.7 2014 Commands out of sync; you can't run this command now (10.192.32.7) ROLLBACK
Icinga is now showing me an icon, but etherpad isn't. Not sure how much of this is valid or it is my browser cache misbehaving.
Looking good so far: https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=18&fullscreen&orgId=1&var-server=dbstore2002&var-datasource=codfw%20prometheus%2Fops&from=now-3h&to=now-1m
@Bawolff That new query you propose makes no sense to me- it just selects the first 100 revisions every single time(e.g. revision id 1 to 100, if they all existed). But sure, if they want to select that (ids and a bunch of NULLs), I don't see any problem with doing that one (although I would batch by rev id without using a limit for more deterministic speed), I am just not sure that is what they *really* want.
I've personally don't remember this query on the top worring list anymore (unlike a bunch of others), I would personally close it as resolved.
That looks really bad performance. Not only that scans the revision table from top to bottom (>200GB of data) making it slow, it is nondeterministically slow- it will be faster or slower depending on the parameters and existing data.
Wed, Sep 19
If deleteBatch.php will be able to use to delete large pages in revision batches when the patch is merged, yes, it will fulfill fully my needs. I couldn't use it before because it batches pages, but not revisions within a page.
^I am assuming this doesn't have yet actionables for us (yet) just on the radar.
So public dumps are ready:
The initial scope is not fulfilled:
to be scheduled independently.
There is also the fact that I executed an ALTER table
It would fit in my theory because it would be (maybe) the first event with GTID that the host received in a long time.
My suggestion for root cause is that mysql works well as long as gtid is not enabled on its master (it was enabled around some days of the crash). Replicating from a non-gtid enabled master worked well for a long time. I have not proof of this, it should be checked.
Tue, Sep 18
The tables cannot be dropped from other wikis just yet, as the extension is still enabled (but mostly disabled) on other wikis.
dbstore2002:s2 is up and running (althought still catching up replication), enough for the backups to run in codfw unaltered.
List of tables to archive:
Blocked on new backup hardware setup.
Cloning is ongoing from db2041 to dbstore2002 /srv/sqldata, will rename later to /srv/sqldata.s2 and restart replication there quickly. After the backups run, some compression process will have to be rerun, which may take quite a lot on the dbstore host with so much load.
So only pending labsdb1004,labsdb1005, dbstore1002 and the parsercache
I will do that, I'll see from where.
Mon, Sep 17
Should just be a case of making sure they're named well, gzip (or similar)
I may have time to work on this these week (the dumping) not sure if they will be available shortly after, as not sure what that entails for custom dumps.
The plan is for us dbas to test setting up a single API with the same structure than eqiad and do all assuming that fixies it, and later we will have to evaluate what is the right long-term status, given some unknowns and related tasks such as T202167:
One note: This index might be already existing but under name of tmp_1 or tmp_2 or tmp_3. I don't know how complex renaming an index is but that'd be useful.
mgmt interface works for me again, waiting for you to be finished to continue with my maintenance.
db1061 should be now fully down and ready for you! @Cmjohnson
Shutting down service...
Is 250K rows inserted per second fast enough? https://dbahire.com/testing-again-load-data-on-mysql-5-6-5-7-8-0-non-ga-and-mariadb-10-0-10-1-and-10-2-non-ga/ (I got those numbers on my 5-year old pc) or will you need more tuning? :-D You can download the sample code here: http://dbahire.com/testing-the-fastest-way-to-import-a-table-into-mysql-and-some-interesting-5-7-performance-results/ and of course we help people precisely with this kind of work. If batch import speed is important, rather read throughput, we can go in a classic master-replica async replication topology with automatic failover (as we have now with misc services) rather than a galera-like cluster.
Fri, Sep 14
db1090 took a long time to recover replication lag after db1062 initial maintenance, even more than dbstore1002- need to check why next week.
Thanks, I will update the summary to remove my guess and point to this better evaluation. CCing @tstarling beacause it could be an intended consequence of the time limit and maybe those things can be captured/handled better (not sure)?
I will shutdown this server fully at 14:11 UTC, this is a self reminder.
second puppet run get's stuck on master reimage because it expects mariadb being up and running.
Thu, Sep 13
db2068 has been recloned, but needs time to catch up replication and then be slowly repooled with the above patch.