Wed, Jan 12
I mean the insert traffic will be there already
Exceptions seem to be nice now after full revert. :-)
The current .16 background issues seem to be exceptions for failing to lock pages at:
Here the synthoms that continue :-(:
The number of row writes per second doesn't seem to have yet gone back to pre-deployment levels:
Tue, Jan 11
I just saw that db2078 has a failed service: prometheus-mysqld-exporter.service I haven't researched further, don't know if it fails because new package, it is a one time failure because of the reimage, or a WIP/known issue, but notifying it here, as you told me to bring up anything weird I saw. This doesn't affect backups or my work in any way.
Mon, Jan 10
Deployed the change as this:
commit baeb288d4d9713814ac88e9537bbcf0ece5bb9e4 Author: generate-dns-snippets <firstname.lastname@example.org> Date: Mon Jan 10 16:17:22 2022 +0000
Thu, Dec 23
Commonswiki codfw backup copy, with 91823709 files backed up and a 0.04% error rate.
Commonswiki is now backed up on 2 geographically redundant locations within WMF infrastructure.
Thank you for the feedback! As we all think this was not disabled for some reason (I've seen the setup scripts fail on some steps sometimes, I will do: https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_can_I_add_the_IPv6_AAAA/PTR_records_to_a_host_that_doesn't_have_it? as documented, and that should take care of it.
I just realized with jbond that the 2 servers (ms-backup1001.eqiad.wmnet and ms-backup1002.eqiad.wmnet) do not have ipv6 AAAA dns records.
Wed, Dec 22
Tue, Dec 21
Dec 17 2021
Codfw commonswiki backups are at 75% completion (68854627 files/301887395014767 bytes backed up), and will likely finish by next week.
Dec 16 2021
This is a prototype version of the (trivial/non-massive) recovery script, interactive version:
Dec 15 2021
A first pass on eqiad finished successfully: 101,970,844 files backed up successfully, with a total size of 373,335,321,603,376 bytes and an error rate (by size) of 0.035%.
I don't think it is worth this being open anymore- there indeed is a need to review it to generate a better description, but hopefully it will be caught up in the incident review process that is kickstarting now.
This is the list of media backup errors (making it NDA-only, as I haven't checked yet everything there is non-private):
Dec 14 2021
Proof it is working:
Dec 13 2021
Check now I've made another restore into /home/krinkle/restore2/ and I think you should be able to see it. I think I see the issue- the setgid forces the inheritance from the parent, which will do weird things when restoring to not-root. This is an important thing to notice when doing non-destructive restores, but I think it shouldn't hit us for a regular restore.
Actually, you may have found something that I am unable to answer you on: that directory (the original one) seems to have the setgid bit on, which of course bacula won't maintain. If that is intended, or how to handle, I cannot tell you. I would ask you to inquiry what is the best way to solve this for the original files, or if we have to workaround it.
Thanks, this was useful to detect something applicable for future automated recoveries. Bacula backups full paths, and keeps the permissions of those paths, so restore, restore/srv, and restore/srv/mediawiki-staging were kept as root.
Terminated Jobs: JobId Level Files Bytes Status Finished Name ==================================================================== 396417 Full 108,320 11.70 G OK 13-Dec-21 09:34 graphite1004.eqiad.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily 396418 Full 108,320 11.70 G OK 13-Dec-21 09:35 graphite2003.codfw.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily
Running Jobs: Console connected using TLS at 13-Dec-21 09:20 JobId Type Level Files Bytes Name Status ====================================================================== 396417 Back Full 4,568 412.9 M graphite1004.eqiad.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily is running 396418 Back Full 0 0 graphite2003.codfw.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily is running ====
Dec 10 2021
Let me give it a deeper look, while the patch by itself looks good as is, I want to check if a different (non-default) backup policy would be more advantageous in frequency and space. :-)
db1102 is back up again, BTW (T296546).
Dec 9 2021
Sorry, when I said closed, I really meant deleted.dblist.
sys schema is https://mariadb.com/kb/en/sys-schema/ and should be on all wmf databases, and the admin user should have access to it on mw databases.
ops is used for the log and functions of the query killer, should be on all mw databases (at least for now).
You should check against the closed.dblist- I have been an advocate that if those were to be kept, they shouldn't be on s3, but on a hypothetical "s0" very small section, (eg. on a vm) to save resources.
Others will be relics of the past of maintenance jobs/creation of wikis by mistake.
Dec 3 2021
@Majavah: A reminder that I will let you or the rest of cloud services team to mark adequately its status at https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge#dbbackups-dashboard for your internal coordination.
The host is down and ready to be serviced, please let us know if stick could be removed successfully, or any other issue arises, as it may require puppet memory adjustments before putting it back into production.
@Cmjohnson Today would be preferred, as Monday I will be off, and won't be able to put it down and back up. I will shutdown the server, and if it cannot be served, we can do it any day on or after the 9th.
from: SYSTEMDTIMER to: email@example.com
Dec 2 2021
But @Kormat will need to bring it back up the following day (or wait till 9th for you).
Sadly, I won't be around on the 7th. There is no issue regarding the move (backups should have finished by that time, ip changes should not affect backups), but either the date has to be moved, or someone will have to stop the servers for me. I can put them up on the 9th when I return (no big issue with those being down for an extended time), but I won't be able to shut them down on the 7th.
unsubing, as I think I was added to this ticket by mistake. This is traffic/traffic security expertise, and they already triaged and aware of the task.
Dec 1 2021
Ah, so I guess not in production, that leaves me less worried. Sorry, I searched for ms-fe and monotonic but couldn't find this ticket.
Reassigning to btullis, as he was the person to bring up the alias issue, so he can evaluate/review if the patch is a good solution to his concerns or not (I don't use aliases).
Nov 30 2021
My only concern is it may want the memory to be mirrored
I think your intentions were good :-). I edited my comment, it started to be supported since MariaDB 10.5 and once we can stop supporting other versions, we should use the new command- for the sake of consistency (primary/replica) :-D.
@Dinoguy1000 I don't think you should do that- "SHOW REPLICA STATUS"; is not a valid mariadb command (yet- it started being accepted on 10.5), "SHOW SLAVE STATUS" is the working command that has to be sent- we don't have a choice on that. <strike>I think you should file an upstream bug instead to mariadb to ask to rename the command</strike> You can file a ticket to use SHOW REPLICA STATUS once we stop supporting <10.5, but as a separate ticket- otherwise those kind of edits would be confusing (they are not names we can chose to change- yet, unlike other identifiers).
Are you ok leaving this server as is, until the refresh happens in Q3
Nov 26 2021
I checked grants on all mariadb::backup_source s (db[2097-2101,2139,2141].codfw.wmnet,db[1102,1116,1139-1140,1145,1150,1171].eqiad.wmnet) at the same time I did a rolling upgrade/restart. I think I kept them all in sync with production (as much as I cook, s6, s7 and x1 have some custom ones), but I removed the wikiadmin@localhost grants when I found it, as I copied the grants from the model I was told to follow. FYI
I don't have the answer to that question, but whenever any of you have the servers and path(s), you can follow the instructions at https://wikitech.wikimedia.org/wiki/Bacula#Adding_a_new_client to send a preliminary backup proposal to Puppet, and I will assist you to merge it with the proper setup (e.g. schedule, day, etc.) - I think it will be more useful to discuss the details over a patch :-).
Nov 25 2021
One more question, to finally decide if setting up weekly full backups or daily but incremental- do all files mostly change completely, or only a subset of them? Incrementals are able to be done with file granularity only (it will backup fully files as long as its path or hash has changed), if value.wsp changes every minute, and there is only 1 per value, we will do "weekly only full", otherwise the daily incrementals may be preferred.
Nov 24 2021
if it's higher than a percentage
Adding this here, in case someone else was confused on how some of those values could shrink (e.g. compared to T63111#5782953) and apparently, some PKs were doubled by being converted to unsigned (e.g. wikidata.rcs).
number of files are (within reason) a non-blocker for bacula, as files are packaged into volumes. It is true that each file is stored as a mysql record, but that should be able to scale until dozens of (US) billons, although it may be slow to recover when rebuilding metadata.
Deployment went as expected- but now that I thought a bit, I think btullis brought up a confusing status now:
Nov 23 2021
it will give us greater flexibility if and when we want the dbstore* and db* configurations to diverge. Is that about right?
^What do you think Data-Engineering people?
Feel free to mark them as expected
clouddb and in general servers that will never be used as applications servers should not have the production admin account- they are of a different realm. The same reason why they don't share root password.
Nov 22 2021
I have no expectation that there would be valuable data to save for any purpose
Being so small, I can store it for a few years on long-term backups (but it won't be there forever-it won't be a valid archival method-, just for a larger time than regular backups). Waiting for feedback.
I just got pinged on T296166#7519256 on stopping backups of this database (this is a non-issue for me), what I want to raise here is that the order of the pending cleanup tasks may not be ideal- if backups are stopped AND database is deleted, there will be nothing to archive, as all existing backups will be purged in 3 months. Plus generating a public export, if required, may also be more difficult after the db is removed.
Nov 20 2021
I find curious all this thread- I feel like people (including me) were discussing through each other- I am going to try clarify where we are after giving a read to all comments :-).
Nov 19 2021
I double-checked and the timestamp where heavy memory usage growth corresponds quite well with an increase in MySQL query rate: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=16&orgId=1&from=1637203009358&to=1637313590000&var-job=All&var-server=db1163&var-port=9104 at T296063, which was reverted at ~8:30, same time where explosive memory growth stopped and went stable. Possibly both issues were caused by the same cause? Disclaimer: I didn't do any deep cause research.
Nov 18 2021
Sorry I commented while manuel did already.
Thanks for the clarification -I misunderstood as if it may be already deployed-, errors stopped (so far) at Nov 18, 2021 @ 04:29:01.232:
However, my response wasn't "P_S is the right solution" (I did mention logstash would be a preferred solution), but pushing against "this (the patch) would make P_S useless" or "this would make tendril useless" as those I believe are not accurate- those would be unaffected- it won't improve P_S tracing, but it won't disrupt it (it is not a blocker).
using only p_s we get the queries redacted
We have a template spreedsheet of some SRE-maintained datasets so we can keep track of its current properties and state. Would that be a useful tool for you to classify your maintained datasets?
I don't know how hard it will be, as it may will require complex code changes, but having a user, ip or actor id available would be nice too if reasonable (they used to be there, but it was lost at some point).
Same for tendril- which of course I am not advocating to keep- also does the right thing when generating digests. You can see the same query structure here (a lock) with different constant and comments under the same md5sum:
Constant are digested, too, not only query normalization/syntax/comments:
It would also make performance_schema useless and there is no way to actually make it work