User Details
- User Since
- May 11 2015, 8:31 AM (569 w, 4 d)
- Availability
- Available
- IRC Nick
- jynus
- LDAP User
- Jcrespo
- MediaWiki User
- JCrespo (WMF) [ Global Accounts ]
Yesterday
My suggestion is to move the current etherpad to etherpad-old right now, and once the due time has passed, we remove it. That will fix the "we need some time to prepare for the change".
Thu, Apr 9
Leaving the old hosts disabled during the weekend, and we will run the decommissioning scripts next week, to verify the new hosts work well on their own.
I will wait as usual to be the last of the replicas to be upgraded. One question, as a reimage could cause a package upgrade- please let me know which is the latest version that can be installed, as dbprov hosts should have the same or higher minor version for the backups to be considered safe.
Wed, Apr 8
@Jhancock.wm I want to thank you deeply the work, a lot! Please note your work will pay off, as regenerating backups will have taken hundreds of hours and also dozens from me of manual work just to set up the replacement.
Let me know when you can.
Tue, Apr 7
backup1012 hosts gerrit backups hourly. As long as we put it down just before maintenance, it could be done any time. If we stop it and downtime it properly and the downtime doesn't last more than a couple of hours.
Wed, Apr 1
This was fixed 5 months ago when the new version was realeased for both bookworm, and then for trixie: https://gitlab.wikimedia.org/repos/sre/wmfbackups/-/commits/v0.8.4_bookworm?ref_type=tags
we're now asking service owners to re-image their existing baremetal servers
Mon, Mar 30
Thu, Mar 26
Might be next week before i can finish that out. I'll let you know
Any update? Even a "No work done, I plan to work on this next X" would be useful.
I've started the data migration to the new hosts @ eqiad, emptying backup1004. Emptying a single host would take around ~16 days, maybe more. Other hosts can be done in parallel, to some extent, but ramping up the transfer speed very slowly.
Tue, Mar 24
Technically there are logs (I've updated the header), just they are useless for us, as they are non-specific enough. A few issues pointing to (but not necessarily caused by) the IME happened in the past, there was never a clear reasoning, it just ended up working after a few firmware updates.
Mon, Mar 23
That's an s7 core host, it is for @FCeratto-WMF to make the call.
Fri, Mar 20
We evaluated Garage and while it is a nice cloud system for personal use, it didn't fit the needs for backup handling, so we went for an arguably less technologically advanced, but a simpler, more flexible, and reliable approach, and more fitting to the exiting client automation, even if performance was scarified: an s3 proxy to the filesystem (versitygw).
Sorry I wasn't clean enough, I was helping debug. I (neither our team) don't own clouddb ( not I am a dba; I won't be touching clouddbs) and we don't handle backups nor recoveries for cloud env either (I have no backups for cloud)
I see some mentions of table corruption the 25 of february. I checked hardware and it seems fine. Given it has only happened on s3, my guess it is due to data corruption. You should reload from a logical copy, this may be causing a segfault every time it reads certain bittrotten table file.
Backups are slowly flowing on eqiad, too:
db1204.eqiad.wmnet[mediabackups]> select count(*), location FROM backups group by location; +----------+----------+ | count(*) | location | +----------+----------+ | 30309863 | 1 | | 29274314 | 2 | | 31524753 | 3 | | 30900062 | 4 | | 28463614 | 5 | | 14012226 | 6 | | 5 | 7 | | 1 | 8 | | 4 | 9 | | 3 | 10 | | 3 | 11 | | 6 | 12 | | 6 | 13 | | 3 | 14 | | 3 | 15 | | 8 | 16 | | 4 | 17 | | 4 | 18 | | 2 | 19 | | 3 | 20 | | 4 | 21 | | 3 | 22 | | 3 | 23 | | 6 | 24 | | 6 | 25 | | 1 | 26 | | 6 | 27 | | 6 | 28 | | 3 | 29 | | 7 | 30 | +----------+----------+ 30 rows in set (30.025 sec)
These old hosts cannot be decommissioned yet, they need to migrate its data to the new hosts first- they are still in production, just read only!
:'-(
The trend is clear here: while old objects had some average size, new ones (shard 6 only had new ones, but it equally space filled), the new average size is 3 times larger, while maintaining the upload speed, meaning we now consume space 3 times faster than usual.
After fixing some authentication and some region configuration issues, backups are flowing now. This view is really nice, show the new 24 shards getting filled:
cumin2024@db2183.codfw.wmnet[mediabackups]> select count(*), location FROM backups group by location; +----------+----------+ | count(*) | location | +----------+----------+ | 30299611 | 1 | | 30295929 | 2 | | 30291310 | 3 | | 30297141 | 4 | | 28000043 | 5 | | 13729535 | 6 | | 103 | 7 | | 108 | 8 | | 128 | 9 | | 113 | 10 | | 27 | 11 | | 33 | 12 | | 29 | 13 | | 33 | 14 | | 48 | 15 | | 43 | 16 | | 50 | 17 | | 44 | 18 | | 108 | 19 | | 98 | 20 | | 96 | 21 | | 110 | 22 | | 117 | 23 | | 92 | 24 | | 123 | 25 | | 108 | 26 | | 120 | 27 | | 114 | 28 | | 118 | 29 | | 95 | 30 | +----------+----------+ 30 rows in set (42.289 sec)
Any update?
Thu, Mar 19
Wed, Mar 18
Mon, Mar 16
Moritz: I would like your assessment on deploying a new storage service for media backups through the profile temporarily called mediabackups::new_storage, so it use the existing mediabackups::storage for migration purposes, but will eventually substitute it.
Only adding Moritz for awareness of trixie, not needing or expecting any work soon.
Please take your time, as I said it can be down for some time.
Fri, Mar 13
DBAs: Given the interactions with clouddbs you may want to be on top of that, as this is likely to be ran on a primary db.
^ @Jclark-ctr Matthew (and Effie) were only the people on call. This should be directed to the owners of the service, the DBAs: @Ladsgroup @Marostegui and @FCeratto-WMF , some of which are not around today.
Thu, Mar 12
We at backups (data persistence) are probably going to replace minio with a low-tech, low-performance, high reliability s3 proxy. It seems to work ok for backups, but may not be the right choice for high performance needs. Let me know if you want to talk about it.
Mar 5 2026
Mar 4 2026
@Papaul for backup1007, dbprov1004, while they are a production host with important content, a small network interruption will not cause any issue. Just give us a heads up if the window gets larger. Let me downtime it for a day. Let me update the ticket.
Mar 3 2026
For context, replication broke on databases, edits were not lost during the incident, but it took an abnormal number of minutes to appear as applied everywhere. Sorry for the disruption, things should be fine, but please report if you see anything else out of the ordinary.
Sure, do as you see fit, all will be good, the old thingy was setup way before expose_puppet_certs existed. I would even go further and setup in the future dedicated certificates outside of puppet, but as you wish.
Feb 26 2026
@Jclark-ctr I just marged thew new recipe, please give it 30 minutes to propagate, and should be done. Apologies again for the mistake.
Will, do sorry, these should use standard recipes, so it should be easy to update.
Feb 25 2026
Feb 24 2026
This is the real error:
Then that message may be misleading and not the cause of the issues, but the error is real:
Backups from gerrit2002 are failing with:
Feb 23 2026
Feb 19 2026
Suggestion: I wonder if could there be a more or less automated way to know if there is an ongoing schema change, e.g. that the script "locks" some file with the name of the section/dc and that is shown on a control panel somewhere, eg. on grafana? I don't know if that would be easy to implement.