Tue, Dec 5
Could it be https://jira.mariadb.org/browse/MDEV-28334 ?
This is now down (with further improvements down the line), will now document it on wikitech.
Mon, Dec 4
@Marostegui (or anyone else) If you have the time, could I ask you for a quick "cli UI" review of the above patch? I have installed a test 0.3.0 package on ms-backup1002 only.
Updating title to reflect current request.
Fri, Dec 1
This is now ready for Puppet 7 and the new CA- only pending thing is to add it to configuration.
Wed, Nov 29
Sorry for the terrible suggestion, but how difficult would be to change the gitlab parser to parse Bug: footers differently and link them to phabricator/render nicely on gitlab? I think it would be important if that is something that could potentially happen in the future or not- My guess is that is a standard that was inherited from bugzilla times and that at some point was implemented into gerrit (?).
Tue, Nov 28
Mon, Nov 27
Thank you, and sorry for the urgency- normally these kind of packages always keep backwards compatibility (and they did here too), and only have optional improvements, but this was a kind of a emergency due to puppet 7 upgrade.
I'm afraid you don't have the latest version, https://debmonitor.wikimedia.org/packages/python3-wmfbackups you should upgrade to 0.8.3+deb12u2.
Thu, Nov 23
We can retrieve disk utilization for the last 1 year from prometheus:
Wed, Nov 22
/var/lib/gerrit2/review_site/cache that sounds like cache. Is that really necessary on a site recovery or would cache get regenerated? Not opposed to backup that anyway just in case, but maybe not hourly? Obviously, I also speak from a place of ignorance-any change should be tested- but wanting to know if the hourly backups are really being effective, given its size (and how they used to be much smaller).
This is now deployed and media-backups schema is up to date. Media backups are flowing as usual. I am no longer a blocker here.
Tue, Nov 21
All work is now done here and only pending the last snapshots to finish to resolve this issue.
I have unclogged things from my side (bacula) and backups are now flowing normally- usually I would close this are resolved, but leaving it to you in case you want to check something else from client side (gerrit1003). Please reassign to who corresponds. Specially regarding my questions at https://phabricator.wikimedia.org/T351725#9350316
New filesystem order arrived! https://grafana.wikimedia.org/goto/2I_2H9IIz?orgId=1
Despite https://gerrit.wikimedia.org/r/c/operations/puppet/+/976286 , I wonder if backups are well executed- the full backup is around 50GB, but each hour right now 20GB are copied. I wonder if that makes sense, if 20GB of new files are generated each hour, or there some avoidable overhead- specially given that a few months ago it was 8GB, and I would expect gerrit usage to go stale as people start using gitlab more.
"Job ... is waiting. Cannot find any appendable volumes"
Metadata from yesterday's backups are starting to pour in:
BTullis closed this task as Resolved.
I believe to have fixed the issue, puppet was wrong after the package fix at T351491.
Fri, Nov 17
This should be now fixed with the new packages + puppet code (but we will see over the weekend).
I belive the work on those tickets was done here CC @fnegri But please double check if there are additonal dbs that were not part of the migration.
Please know an important update of wmfbackups package for compatibility with Puppet 7 will be pushed soon (wmfbackups 0.8.3 - and its related subpackages) T351491 - it should not affect cloud usage- as you don't use the dbbackups stata gathering, but you should upgrade ASAP to avoid future errors.
Tue, Nov 14
Volans, based on your comments at T330882, the original scope I needed for transfer.py is already fullfilled. I am guessing this is just open to "improve it" beyond that in the latest, increased scope?
Mon, Nov 13
Nov 10 2023
Indeed, the same schema change for production has to be applied to backup metadata, as we mirrored the size from mediawiki as an unsigned int:
Thank you, @AlexisJazz that's useful feedback that without doubt will make our media storage happy- still there are additional technical operations and challenges to overcome. Cost is not as much the concern (specially for enwiki needs)- we get more concerned about Commons with its almost half a PB of storage, but still servers have to be purchased, racked and installed, data resharded, and everything planned, and it takes some time- it is not a question of just "buying larger disks". :-D
@AlexisJazz While we are happy that you are excited about this, this is by far not ready for discussion. Developers just handed out the code, but this requires still a lot of preparation and discussion to be able to be implemented at WMF by system administrators due to the scale of operations- with a lot of open questions regarding Swift extra space needed, backup compatibility, schema changes deployment, and many other work needed.
Nov 9 2023
Feel free to contact other SREs that can support you (can be those in data engineering, as they may know more about Hadoop) to support you and they can get back to me directly too, if that helps.
Please loop me in in the progress, while this doesn't affect production, I may have assumed in some cases that files were always smaller than 4 GB for backups, and I may need to review its storage compatibility- even if it is just applying the same schema change on backup metadata.
Nov 8 2023
While checking the things I need to apply the change, I need 2 additional data points-
One last thing- legal rarely comments on stuff here in public on Phab- you may want to reach them directly.
Ok. Then it seems we are ok for the most part- I will start working then on access, as this is the first time such access is requested, so please be patient- it won't be as fast as regular access request- but it shouldn't be too hard either. I will take the opportunity to document the grant creation in case in the future there are other requests.
Thank you a lot, worker.reporter was exactly what I think @MatthewVernon wanted (ignoring logging by the executioner and roll our own logging of what was an error), not ok_codes. I've updated the ticket to reflect more accurately the request.
Nov 7 2023
Looks good to me!
It can be used for production for now, temporarily, but I will eventually need it for testing backups. Sadly, testing is not in the top of priorities, but the idea was to finally use it (or a replacement) for that this fiscal.
Nov 6 2023
After much struggle, reimage worked correctly.
Ah, I see the issue:
the given socket does not have a known format
Nov 2 2023
One additional request- the last week of the year it is wmf holidays and I was scheduled for clinic duty. I still want to do my part, so if you can override that (nobody is supposed to be around during the Christmas week) and add me at the end of the schedule- Or if it is too much change, I can just be the person to ping if someone cannot do their clinic duty turn, or help the person after christmas, whatever you see fit?
Adding Arnaud so he can add this ticket to his reading list, for context of previous conversations about transference tooling needs- but requiring nothing else.
Yeah, there are already tickets open for the pending issues (logs, arguments, etc.).
It happened again, bumping priority.
Oct 31 2023
We should discuss this a bit- as this changes not only the initial hypothesis, but also the restrictions of your project:
reprepro changes: add bookworm-wikimedia deb main amd64 wmfbackups 0.8.3+deb12u1 -- pool/main/w/wmfbackups/wmfbackups_0.8.3+deb12u1_amd64.deb add bookworm-wikimedia deb main amd64 python3-wmfbackups 0.8.3+deb12u1 -- pool/main/w/wmfbackups/python3-wmfbackups_0.8.3+deb12u1_amd64.deb add bookworm-wikimedia deb main amd64 wmfbackups-check 0.8.3+deb12u1 -- pool/main/w/wmfbackups/wmfbackups-check_0.8.3+deb12u1_amd64.deb add bookworm-wikimedia deb main amd64 wmfbackups-remote 0.8.3+deb12u1 -- pool/main/w/wmfbackups/wmfbackups-remote_0.8.3+deb12u1_amd64.deb
@Ladsgroup Yeah, I thought so- so let's try to come up both of us with a strategy that is relatively simple to implement in automation and we can discuss it next Monday- to either reduce granularity or archival (for which there is currently no mechanism in place).
Oct 30 2023
For bacula it will be a bit more complicated- the first week there is usually more overload. So maybe, if it can be done tomorrow, wait until 11 am UTC (or a week later). I can prepare the patches by then.
Tomorrow (or if you need more time, one week later) would be a good day- backups will have ran the night before (they usually finish at 5:20 UTC) and you are free to do any maintenance there.
I was planning on creating a package for bookworm soon, but I cannot provide any timeline.
Oct 24 2023
As a followup- consider in the future documenting the special grants on puppet. We don't have a good solution to monitor and assign them, but at least for now we document them at this location: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/templates/mariadb/grants/analytics-replica.sql