Further exploration of the existing metadata has been done at: https://gerrit.wikimedia.org/r/637769
Yes, no problem with that, I was just giving context in case you didn't remember about that, after they got moved around.
Small hint- note there is 2 testing hosts on eqiad, db1077 and the "backup testing" host, db1133.
Thu, Oct 29
Not related, but first incident was spotted during datacenter-switchover.
This is no longer UBN, please feel free to consider it resolved/declined based on my comments on the subtask.
I am going to consider this resolved, unless we were completely wrong and this wasn't the cause of the database stalls/connectivity issues.
I am mostly certain that this was the issue causing db1075 stalls, as processlist has decreased a lot (a slow query can be millions of times more impactful than a regular query).
Looking great so far-
If I can provide more background, unless normal circumstances, pc* hosts are active-active, and no change should happen on them (no read only changes, etc.). This was solved on zarcillo by setting masters per datacenter so no change has to happen. Because zarcillo never substituted tendril, the issue is not as much with the switchover scripts as with tendril model, which can only setup one master per global replica set, an not one per datacenter. Per convenience, on tendril the "masters" are considered the ones on the active dc, but that is not really accurate to reality.
Because this is actively causing outages for 800+ wikis on s3.
Adding MediaWiki-General because we need to identify which team may know about DynamicPageListHooks::renderDynamicPageList
Wed, Oct 28
I have added it to the logical backup process by adding the right grants to the existing dump user/process to the new database, but let's revisit once people working on the setup are happy with the deployment.
For archival purposes, this is the (naive) code solution for downloading all images of a wiki using the mwclient library (https://gerrit.wikimedia.org/r/636007):
Tue, Oct 27
<icinga-wm> RECOVERY - Check systemd state on dbprov1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
Independently of the source of the issue, could these regenerations be throttled/rate limited (assuming it is a background process)? It is clear that while they don't affect (a lot) the same dc databases, the number of writes do not scale over multiple datacenters?
So my guess is this is only happening on buster.
Mon, Oct 26
I didn't find a ticket, so maybe it was only an informal conversation with no actionables. This was something we wanted to do, because when implementing "primaryhost.slaves()" on WMFMariaDB code we didn't have a report of the host.
I believe there was a ticket were we refereed this, let me try to search it as I think I run into this issue for the WMFReplication class.
I am getting strange, inconsistent results every time I check, now I've seen the increase happening starting at ~9:04h (with no deployments around that time):
Adding Wikimedia-production-error as it seems to coincide with a non-train deploy at 16:45 on the 22. I am unable to find it on SAL, however?
Sun, Oct 25
I'm not sure (really) what exactly you want.
Fri, Oct 23
This is not a serious blocker, but maybe it could be an "easy" task for a newcomer, assuming people are ok with it?
First run has been done on all hosts, all clean now as far as mysqlcheck / CHECK TABLES is concerned (only commonswiki on db2099 had a bad index, now fixed and rechecked).
Thu, Oct 22
After running in a very supervising way mysqlcheck on almost all hosts, I can say this is not as easy as "just setting up a cron and run it every week". The CHECK TABLES command on all tables can take up to 24 hours per host, and it is very impacting. We don't have the proper monitoring tuning configuration to handle this, plus it makes backups fail frequently if both run concurrently (at least 3 snapshots failed because of ongoing checks).
I am dropping and then recreating the index on different transactions with the hope that that will be a bit faster than recreating the full table- I will do a check tables at the end to check that fixes it.
Wed, Oct 21
We finally have a positive:
I am going to consider it as resolved, as this was created in the past for a very specific event, but it is no longer very concrete. That doesn't mean documentation shouldn't improved, is that I see no value on tracking anything concrete on a task. Reopen if you disagree. Documentation should be improved a lot, but tasks should have reason to be open, and there is no longer much activity/clear actionables here.
the host will need reimage
We should drop the profiling table from source backup hosts before setting up the regular checking to prevent extra log spam.
I will take care of dropping it first on the source backups so those don't contaminate other host, other host will have to wait until dc switchback from codfw to eqiad.
I did a manual systemctl reset-failed once.
Aside from that, what I can do is add a check just before rotation to "latest" to see if there is something "reading" the dir and kill it before moving it to latest? Maybe restrict it to myloader pid/recovery script?
Everything there looks fine! There may be procedures that I could help you simplify to be done more easily, we can talk on a different medium at a later time to avoid spamming other people here.
I am going to decline this, not because it is a bad suggestion, but because the fix is not really a fix, as much as a "way to avoid alerting" (aka reduce toil), I want to make sure this is toil- and happens more than once before deploying it. If it happens again, CC @Kormat reopen this and I will just deploy it on all dbprovs.
@Marostegui s2 on codfw gave no errors. This is what I expected- given we had no issues in the past with 10.1, I think it has to be the combination of corruption and the upgrade to 10.4. We can do a test of moving s2 to 10.4 on a test host, and then running the test? Or we can establish to do so after every upgrade ( a full check tables and not only the one done for upgrade, even if it takes longer).
I am going to be bold and close this as fixed, based on original reporter response, pending tasks could be fixed at parent T109238.
Sorry for the late response, it was very late on our TZ.
Tue, Oct 20
As a last comment, I thought at first it was 1, but after some analysis, I believe there are more chances that it was 2, given the rows involved were very frequent, but as you say, it is not easy to prove it.
ES are backed up, but currently only locally. We need to finish the cross-dc backup, hopfully on Q3.
I think I will need guidance of what is not clear from someone less "into" backups. I understand what it is now is not great, the problem is I am too deep into the rabbitwhole to try to figure how to approach this (TODOs? recipes? Examples?)
Wishlist but not planned at the moment, we need first to work on object inventory- but we will want it eventually to check live data corruption/backup corruption.
This is most likely delayed to Q3 or even if we setup an alternative backup method to bacula.
I am going to consider this resolved- there is monitoring, and we have a dashboard and tooling for it (command line and prometheus exporter) (https://grafana.wikimedia.org/d/413r2vbWk/bacula). Everything documented at: https://wikitech.wikimedia.org/wiki/Bacula#Monitoring
So the second part is kinda expected "Running myloader..." will indicate that the process has started and it won't finish as long as the underlying myloader hasn't finished... which unless I understood incorrectly, it hadn't finish (it was blocked)?
I can confirm backups have been flowing weekly as expected:
Not sure who owns this to declare it resolved, but as the original reporter, I think it is, using the above link, there was no incident in the last 4 weeks. Thanks to everyone that helped here.
To be fair, technically, this is resolved because wb_terms has been replaced, AFAIK, with a more specialized mechanism (several smaller and normalised tables) :-)
In other works this is a subtask of bigger issue T146149, specific to the backup-related hosts.
I think Manuel and/or I requested to document what grants are needed to setup a backup host. The problems is there is no good way to do so- as grants are currently only maintained/supported to document on a text file for core mediawiki hosts, and there is no way to define/document non-core/non-misc grants.
@Marostegui 2 questions:
Another small correction:
it could bring us capability to write into the ParserCache from the secondary DC, which we don't currently need, but certainly could think of some usages for it
Small addendum: Note that parsercache functionality is memcached + MySQL, not just MySQL. In fact the MySQL part was a later addition for disk persistence/larger dataset.
This release deprecates the sshd_config UsePrivilegeSeparation
option, thereby making privilege separation mandatory. Privilege
separation has been on by default for almost 15 years and
sandboxing has been on by default for almost the last five.
Backup of commons files is a part of the more ambitious: "Backup al wikis media files" project being worked currently at: T262668 and subtasks.
It's something we seemingly only do in a minority. At least in MW core, only on TINYINT, in two cases. 17 other tinyint in tables.sql don't do it
Is s2 completed
I have one question before everything else- does the parsercache expansion mean like a new "cluster/service" in parallel to the existing parsercache or would it be more like an expansion of the current service, to increase the number of hits/change the pc policy to store more data?
But the backup source hosts have notifications disabled for lag, no?
Mon, Oct 19
100% agree with this, sadly, the lag part is not configurable yet on icinga alerts. :-( We'll see what I can get done, however for now I will just run it manually on all hosts to discard ongoing issues.
Implementing this should be relatively easy, just running mysqlcheck -c -A on the host. The problem is how to deal with potential replication delays?
Ladsgroup: I already gave an opinion at T119173#6064823 (please note that that was in context of the original proposal of the task "ban ENUMs", not the context of the current task/RFC. That and subsequent comments already capture the essence of the my suggestions (discourage them and encourage table normalization). Of course, maybe those DBAs acting on schema changes may have additional thoughts.
Interesting trivia: MySQL is going to deprecate definition of length for numeric values: https://dev.mysql.com/worklog/task/?id=13127 We should stop defining them in the first place.