There are two clusters in the normal ES, with writes (if I understand correctly) randomly going into either. I assume that's for handling load (of which there's not much in Flow) so we are fine with just one?
We will have to do most of this anyway for T226704.
Hi, I am bit disconnected about the planning of deployment of this- Once all hosts (or all hosts that are planned above being migrated, is the puppet line supposed to go on the profile (or role) or on base.pp with some exclussions? It is not clear based on the ticket description and comments, or I may have missed it as it is a long ticket :-D.
Mon, Aug 19
Maybe gtid will become usable at 10.4 ? https://jira.mariadb.org/browse/MDEV-12012?focusedCommentId=132462#comment-132462
Sadly switchover.py wouldn't be reusable or helpful (the replication and other libraries may be) for an emergency- it has to start from 0. Switchover.py assumes all hosts are reachable and have very low lag, replication is working, etc. which won't be the case on a failover. A failover is a much harder case where every possibility of breakage has to be contemplated separately and some safe compromises have to be taken (e.g. what to do if we detect X amount of data has been lost).
Fri, Aug 16
Not sure what is the status of this, considering T228258 exists. db2063 mysql is down, but I ain't touching it just to prevent breaking something.
Jul 19 2019
Going on vacations, will not work on this at the moment.
Considered resolved with just T219631#5172069, and documented at https://wikitech.wikimedia.org/wiki/transfer.py A more complete automation will be done later (on a separate task) where automated provisioning is done, including setup replication, starting the server, loading grants, etc.
For posterity, this issue generates the following error being mailed:
Jul 18 2019
I can see an abnormal number of transactions on the enwiki master in Sleep state with 195+ seconds of connection time. This is not normal. The errors may be the watchdog killing connections to prevent a larger issue.
I was also reported high "lag" between edits and them showing into recentchanges on at least cswiki, but databases had virtually no lag at the moment. This could be related- transaction taking too long to commit.
I personally don't have any opinion about Wikimedia-Rdbms I don't interact with it or need it or I am not subscribed to it. It is nice that we are pinged here because of the obvious connection with DBA (thanks!) but I will let any users or potential users decide its usefulness (redefine it, rename it, reorganize it, delete it, merge it, etc.). I believe it was useful in the past because there was no official owner to the RDBMS, but I think that is no longer the case.
Jul 17 2019
SAS HD disks of 1.819 TB.
There is no spare USED disks.
@Marostegui Double checking, should we replace this or is it being decommed now?
Jul 16 2019
Could also confirm all puppet grants (mysql database is understood, of course) on puppet database are no longer needed? You can find it on the misc production grants.
Also the passwords have to be removed from the private repo (and possibly from labs/private).
What about the puppet database on m1?
FYI, after applying the above change, I expected a huge shift on reported load (even if performance didn't change) or on temperatures, given this (wikireplicas on labs) are our busiest databases on cpu resources due to long-running queries, however, I didn't see much difference, unlike other reporters, except on the temperatures of labsdb1011, none on the load or the temperatures of the others. Maybe CPU was already a problem in scaling for database load or something else? https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=1563087571551&to=1563260371552&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&var-instance=labsdb1009&var-instance=labsdb1010&var-instance=labsdb1011
Jul 15 2019
Once the backlog is processed, https://grafana.wikimedia.org/d/000000102/production-logging?refresh=5m&panelId=8&fullscreen&orgId=1 This can be lowered to high, but something should be put in place to prevent another logs outage, even if it is a rough way, such as an alert to identify it and a runbook to drop a source of logs like above.
I never talked about this issue, and had not idea why @Urbanecm thoughout I was talking about this while I was having a private conversation with other person.
So I have left a dump of the math table on public wikis at: https://people.wikimedia.org/~jynus/math.tar There is an example of the format of the tar at: https://people.wikimedia.org/~jynus/aawiki-math.sql I have also backed up the private and closed ones for short term. What would it be the next steps, comparing the hashes with existing images?
I will remove DBA but remain subscribed, as all points this should be a decision of mediawiki optimization for 3rd party usage, not WMF.
Because T135969 has spilled here, I can see the annoyance with someone using a "bad" encoding, I will try to be a bit conciliatory, however, I don't see any way to solve that, other than documenting (maybe it was done already):
Jul 12 2019
Thanks, that is great info, I will come back with a list of images and/or a dump at the beginning of next week.
See, we had a bit of a communication loss here, but we may be able to move forward :-D Each person only has a limited amount of information. I will give a look and do some tests and (if you can) I may ask additional questions. Requesting for help is always ok, and we are here to help. :-) Thanks for yours! We are the people most interested on cleaning up the existing tables.
Thanks, if you allow me- I will copy that and reopen so this is still tracked. And maybe I can check or delete on a low traffic db and see what are the results. If you have time, you can help us with your knowledge and my access to figure it out :-)
Sorry, but I may not be understanding your comments. Is the extension on WMF ready to have them deleted, if yes, please say so ("please delete ASAP") on the summary. Deleting things carefuly should take less than a week. If you don't have time to do the preparation needed/there is a blocker, I am ok with closing it, although I would mention it also on the summary what is the pending work because may be someone else can do it in the future. :-D
Thanks, that was probably it. Thanks for clarifying it!
I don't know what I could do to make the task more ready. I will just close the task. Maybe it's best to keep the table forever.
Why declining it? We are waiting for the blocking work to own this, but nobody seems to be progressing on it.
email@example.com[zarcillo]> update masters set instance='db2069' where section='x1' and dc='codfw'; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0
And here is probably the issue:
Based on your feedback, my guess is that because you are reading from the "master" (there is no other host, really), this effect happens. Probably this doesn't happen on WMF production where reads are from replicas and they are already read only, and lag is (I believe) cached. I am guessing the lack of caching infrastructure + single master topology is causing this. That doesn't mean it is not an issue, there is probably a way to optimize this, but I will let others comment if/how, as I am more familiar with WMF use case than mediawiki in general.
I saw a few occurrences (5) of this on trwiki: https://logstash.wikimedia.org/goto/d9c3d20188039cd82d1d9367270f842c
Everything went well except:
Updating tendril... [WARNING] Old master not found on tendril server list Updating zarcillo... [WARNING] Old master not found on zarcillo master list
Based on https://noc.wikimedia.org/conf/highlight.php?file=db-codfw.php&1 and T184888 I will switchover codfw master to db2069.
I will try to force a relearn process/reboot, in case that works.
I'd say it's safe to store a backup
@Papaul We will ask you to replace a disk here from T226406, when they arrive.
@Anomie I support strongly to close old tasks like this that can no longer be reproduced, too much in the code and infrastructure changes to be relevant later in an open status. I belive both mw improvements, mysql status e.g. (ANALYIZE + reboot, SSDs) improvements and mariadb code improvement may have left many of these obsolete.
Chris, you will need to coordinate with @elukey principally, as he is the person in touch directly with users affected to agree on a date.
If I recall correctly, the maintenance was along the lines of running MediaWiki's recompression scripts and other similar scripts, which would first require making sure the scripts still work right.
@Anomie it is probably the old decision of Mediawiki vs Wikimedia. I don't have visibility of what is the impact outside of WMF, but I would suggest to set up a lower priority, generic task to review optimizer hints and document them or remove the unnecessary ones, starting by this one.
It is ok to close it if it is a duplicate or you think is unlikely to happen again or is a very rare occurence. I just report when I see something out of the ordinary on the logs FYI, but lack the knowledge of a deep analysis.
May I ask where you tested this, and if it was on your own installation, more data about it (version, topology, configuration, etc.?), and in any case, how you did profile the queries executed (just setup debug for all queries?)? Also please point us to the code entry for that function (is it using a master or a replica to perform the reads?).
See also recent T217755
Jul 11 2019
I think your classification was already right, I was proposing to add on top a new yellow one, as normally work on those require a combination of performance, DBAs and either core or other team on the product side, depending on the code module. The reasoning for that is that it would help avoid duplicate reports, in the same spirit as Wikimedia-production-error. It was just a suggestion, I understand if you consider it may not be useful.
@Krinkle It took me some time to understand your comment and classification. Wouldn't be nice to have a specific tag for #query-performance or #wikimedia-query-performance or #slow-database-queries for discovery reasons (e.g. finding duplicates and previous examples)? What do you think?
Needs some research to see if it is safe. Low priority because it shouldn't block any other task.
Manuel is on vacations ATM, I am glad to answer any questions, although DBAs need more concrete questions (e.g. we can answer how much space and iops would be saved for a particular wiki or table) as costs such as development time would be better calculated by the people involved on the Wikimedia-Rdbms code bits.
root@prometheus2003:/srv/prometheus/ops/targets$ ls -la mysql-* -r--r--r-- 1 root root 2592 Jul 11 11:27 mysql-core_codfw.yaml -r--r--r-- 1 root root 612 Jul 11 11:27 mysql-dbstore_codfw.yaml -r--r--r-- 1 root root 544 Jul 10 10:57 mysql-labs_codfw.yaml -rw-r--r-- 1 root root 544 Jul 10 10:48 mysql-labsdb_codfw.yaml -r--r--r-- 1 root root 621 Jul 11 11:27 mysql-misc_codfw.yaml -r--r--r-- 1 root root 275 Jul 11 11:27 mysql-parsercache_codfw.yaml root@prometheus2003:/srv/prometheus/ops/targets$ date Thu Jul 11 11:29:19 UTC 2019 root@prometheus2003:/srv/prometheus/ops/targets$ run-puppet-agent Warning: Downgrading to PSON for future requests Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for prometheus2003.codfw.wmnet Info: Applying configuration version '1562844569' Notice: /Stage[main]/Profile::Prometheus::Ops_mysql/Exec[generate-mysqld-exporter-config]/returns: executed successfully Notice: Applied catalog in 18.99 seconds root@prometheus2003:/srv/prometheus/ops/targets$ ls -la mysql-* -r--r--r-- 1 root root 2592 Jul 11 11:27 mysql-core_codfw.yaml -r--r--r-- 1 root root 612 Jul 11 11:27 mysql-dbstore_codfw.yaml -r--r--r-- 1 root root 544 Jul 10 10:57 mysql-labs_codfw.yaml -rw-r--r-- 1 root root 544 Jul 10 10:48 mysql-labsdb_codfw.yaml -r--r--r-- 1 root root 621 Jul 11 11:27 mysql-misc_codfw.yaml -r--r--r-- 1 root root 275 Jul 11 11:27 mysql-parsercache_codfw.yaml
Jul 10 2019
We already have sizes of all uncompressed and compressed tables on zarcillo, those are planned to be shown in a dashboard. The reasons why those are not more public is that we were told not to put those on public prometheus by security as they could compromise the anonymity of certain users on smaller wikis. Please talk to security before doing it. Please talk to us DBAs befere reimplementing an existing feature.
Jul 9 2019
$ ./replication_tree.py db1065 db1065, version: 10.1.33, up: 1y, RO: OFF, binlog: MIXED, lag: None, processes: None, latency: 0.0991 + db1117:3322, version: 10.1.39, up: 32d, RO: ON, binlog: MIXED, lag: 0, processes: 15, latency: 0.0423 + db1132, version: 10.1.39, up: 14h, RO: ON, binlog: MIXED, lag: 0, processes: 16, latency: 0.0416 + db2044, version: 10.1.39, up: 4d, RO: ON, binlog: MIXED, lag: 0, processes: None, latency: 0.0046 + db2078:3322, version: 10.1.39, up: 47d, RO: ON, binlog: MIXED, lag: 0, processes: 14, latency: 0.0056
Jul 8 2019
HW seems to be fixed, owning for the followup (software) steps.
Jul 4 2019
Jul 3 2019
Just to discard any kind of database anomaly, I ran: