Fri, Jan 21
^@Marostegui I just remembered that dbbackups point to db1159, and not the proxy, due to the current TLS certificate limitation, and the worry about sensitive data being accessed cross-datacenter. It will have to be deployed after switchover. I can do it but involving you in case I don't happen to be around.
+1 as owner of database dbbackups and not sure if something else there.
Wed, Jan 19
Any dbprov host, if it is a snapshot on /srv/backups/snapshots/latest and if it is a logical dump on /srv/backups/dumps/latest. Ideally with a recognizable file or dir name, even if the section is made up (e.g. dump.tendril.<date>).
To expand marostegui's answer (as I also reasearched it at T271148#6735477):
Can we "just" add the following
I need to talk to LSobanski, I mentioned the need for thinking about when upgrading the backup infrastructure, but I don't remember if we ended up scheduling it for this quarter or for a future one. It is certainly in the TODO, but because there should be no hard dependency this time against databases, I can take care this, sooner or later, as it may require some non-trivial upgrades and dependency upgrades/building new packages for backup software that may not be yet ready or need some planing (e.g. bacula upgrade, while technically not a dependency, it will be a relatively large project), plus it needs weighting against the several backup's priorities (if there are more urgent things, we may delay it a bit).
Codfw first pass finished for all wikis, this is the percentage of errors:
Tue, Jan 18
Sorry to add @daniel but you are the biggest expert I know about metadata tables, and the content table in particular. Could you help me come up with an explanation why something like "object type text table references on dewiki" may have regressed? Or php serialization changes? See my above comments.
After reading some docs, the text row seem to indicate "it has the same content as text row oldid = 5815762 (which is revision 5806950: https://de.wikipedia.org/w/index.php?title=Enzym&oldid=5806950 ) and that seems very plausible in context.
I think I found it, the referenced text row says:
Hey, should I try to search backups for that blob? I guess there is a very small chance it is there, but it takes very little to check. Do you know the missing key on ES database to search for?
Wed, Jan 12
I mean the insert traffic will be there already
Exceptions seem to be nice now after full revert. :-)
The current .16 background issues seem to be exceptions for failing to lock pages at:
Here the synthoms that continue :-(:
The number of row writes per second doesn't seem to have yet gone back to pre-deployment levels:
Tue, Jan 11
I just saw that db2078 has a failed service: prometheus-mysqld-exporter.service I haven't researched further, don't know if it fails because new package, it is a one time failure because of the reimage, or a WIP/known issue, but notifying it here, as you told me to bring up anything weird I saw. This doesn't affect backups or my work in any way.
Mon, Jan 10
Deployed the change as this:
commit baeb288d4d9713814ac88e9537bbcf0ece5bb9e4 Author: generate-dns-snippets <email@example.com> Date: Mon Jan 10 16:17:22 2022 +0000
Dec 23 2021
Commonswiki codfw backup copy, with 91823709 files backed up and a 0.04% error rate.
Commonswiki is now backed up on 2 geographically redundant locations within WMF infrastructure.
Thank you for the feedback! As we all think this was not disabled for some reason (I've seen the setup scripts fail on some steps sometimes, I will do: https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_can_I_add_the_IPv6_AAAA/PTR_records_to_a_host_that_doesn't_have_it? as documented, and that should take care of it.
I just realized with jbond that the 2 servers (ms-backup1001.eqiad.wmnet and ms-backup1002.eqiad.wmnet) do not have ipv6 AAAA dns records.
Dec 22 2021
Dec 21 2021
Dec 17 2021
Codfw commonswiki backups are at 75% completion (68854627 files/301887395014767 bytes backed up), and will likely finish by next week.
Dec 16 2021
This is a prototype version of the (trivial/non-massive) recovery script, interactive version:
Dec 15 2021
A first pass on eqiad finished successfully: 101,970,844 files backed up successfully, with a total size of 373,335,321,603,376 bytes and an error rate (by size) of 0.035%.
I don't think it is worth this being open anymore- there indeed is a need to review it to generate a better description, but hopefully it will be caught up in the incident review process that is kickstarting now.
This is the list of media backup errors (making it NDA-only, as I haven't checked yet everything there is non-private):
Dec 14 2021
Proof it is working:
Dec 13 2021
Check now I've made another restore into /home/krinkle/restore2/ and I think you should be able to see it. I think I see the issue- the setgid forces the inheritance from the parent, which will do weird things when restoring to not-root. This is an important thing to notice when doing non-destructive restores, but I think it shouldn't hit us for a regular restore.
Actually, you may have found something that I am unable to answer you on: that directory (the original one) seems to have the setgid bit on, which of course bacula won't maintain. If that is intended, or how to handle, I cannot tell you. I would ask you to inquiry what is the best way to solve this for the original files, or if we have to workaround it.
Thanks, this was useful to detect something applicable for future automated recoveries. Bacula backups full paths, and keeps the permissions of those paths, so restore, restore/srv, and restore/srv/mediawiki-staging were kept as root.
Terminated Jobs: JobId Level Files Bytes Status Finished Name ==================================================================== 396417 Full 108,320 11.70 G OK 13-Dec-21 09:34 graphite1004.eqiad.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily 396418 Full 108,320 11.70 G OK 13-Dec-21 09:35 graphite2003.codfw.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily
Running Jobs: Console connected using TLS at 13-Dec-21 09:20 JobId Type Level Files Bytes Name Status ====================================================================== 396417 Back Full 4,568 412.9 M graphite1004.eqiad.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily is running 396418 Back Full 0 0 graphite2003.codfw.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily is running ====
Dec 10 2021
Let me give it a deeper look, while the patch by itself looks good as is, I want to check if a different (non-default) backup policy would be more advantageous in frequency and space. :-)
db1102 is back up again, BTW (T296546).
Dec 9 2021
Sorry, when I said closed, I really meant deleted.dblist.
sys schema is https://mariadb.com/kb/en/sys-schema/ and should be on all wmf databases, and the admin user should have access to it on mw databases.
ops is used for the log and functions of the query killer, should be on all mw databases (at least for now).
You should check against the closed.dblist- I have been an advocate that if those were to be kept, they shouldn't be on s3, but on a hypothetical "s0" very small section, (eg. on a vm) to save resources.
Others will be relics of the past of maintenance jobs/creation of wikis by mistake.
Dec 3 2021
@Majavah: A reminder that I will let you or the rest of cloud services team to mark adequately its status at https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge#dbbackups-dashboard for your internal coordination.
The host is down and ready to be serviced, please let us know if stick could be removed successfully, or any other issue arises, as it may require puppet memory adjustments before putting it back into production.
@Cmjohnson Today would be preferred, as Monday I will be off, and won't be able to put it down and back up. I will shutdown the server, and if it cannot be served, we can do it any day on or after the 9th.
from: SYSTEMDTIMER to: firstname.lastname@example.org
Dec 2 2021
But @Kormat will need to bring it back up the following day (or wait till 9th for you).
Sadly, I won't be around on the 7th. There is no issue regarding the move (backups should have finished by that time, ip changes should not affect backups), but either the date has to be moved, or someone will have to stop the servers for me. I can put them up on the 9th when I return (no big issue with those being down for an extended time), but I won't be able to shut them down on the 7th.
unsubing, as I think I was added to this ticket by mistake. This is traffic/traffic security expertise, and they already triaged and aware of the task.
Dec 1 2021
Ah, so I guess not in production, that leaves me less worried. Sorry, I searched for ms-fe and monotonic but couldn't find this ticket.
Reassigning to btullis, as he was the person to bring up the alias issue, so he can evaluate/review if the patch is a good solution to his concerns or not (I don't use aliases).
Nov 30 2021
My only concern is it may want the memory to be mirrored
I think your intentions were good :-). I edited my comment, it started to be supported since MariaDB 10.5 and once we can stop supporting other versions, we should use the new command- for the sake of consistency (primary/replica) :-D.
@Dinoguy1000 I don't think you should do that- "SHOW REPLICA STATUS"; is not a valid mariadb command (yet- it started being accepted on 10.5), "SHOW SLAVE STATUS" is the working command that has to be sent- we don't have a choice on that. <strike>I think you should file an upstream bug instead to mariadb to ask to rename the command</strike> You can file a ticket to use SHOW REPLICA STATUS once we stop supporting <10.5, but as a separate ticket- otherwise those kind of edits would be confusing (they are not names we can chose to change- yet, unlike other identifiers).
Are you ok leaving this server as is, until the refresh happens in Q3
Nov 26 2021
I checked grants on all mariadb::backup_source s (db[2097-2101,2139,2141].codfw.wmnet,db[1102,1116,1139-1140,1145,1150,1171].eqiad.wmnet) at the same time I did a rolling upgrade/restart. I think I kept them all in sync with production (as much as I cook, s6, s7 and x1 have some custom ones), but I removed the wikiadmin@localhost grants when I found it, as I copied the grants from the model I was told to follow. FYI
I don't have the answer to that question, but whenever any of you have the servers and path(s), you can follow the instructions at https://wikitech.wikimedia.org/wiki/Bacula#Adding_a_new_client to send a preliminary backup proposal to Puppet, and I will assist you to merge it with the proper setup (e.g. schedule, day, etc.) - I think it will be more useful to discuss the details over a patch :-).
Nov 25 2021
One more question, to finally decide if setting up weekly full backups or daily but incremental- do all files mostly change completely, or only a subset of them? Incrementals are able to be done with file granularity only (it will backup fully files as long as its path or hash has changed), if value.wsp changes every minute, and there is only 1 per value, we will do "weekly only full", otherwise the daily incrementals may be preferred.
Nov 24 2021
if it's higher than a percentage
Adding this here, in case someone else was confused on how some of those values could shrink (e.g. compared to T63111#5782953) and apparently, some PKs were doubled by being converted to unsigned (e.g. wikidata.rcs).
number of files are (within reason) a non-blocker for bacula, as files are packaged into volumes. It is true that each file is stored as a mysql record, but that should be able to scale until dozens of (US) billons, although it may be slow to recover when rebuilding metadata.
Deployment went as expected- but now that I thought a bit, I think btullis brought up a confusing status now:
Nov 23 2021
it will give us greater flexibility if and when we want the dbstore* and db* configurations to diverge. Is that about right?
^What do you think Data-Engineering people?