User Details
- User Since
- May 11 2015, 8:31 AM (578 w, 4 d)
- Availability
- Available
- IRC Nick
- jynus
- LDAP User
- Jcrespo
- MediaWiki User
- JCrespo (WMF) [ Global Accounts ]
Wed, Jun 10
I will try on Friday or ASAP.
This is resolved, decom tasks will be done on the parent task.
Re: backup2013, it needs no special treatment other than downtime, it has no issue with a temporary network maintenance unless it gets extended for a few days, as backups usually run during UTC night.
Mon, Jun 8
I have moved the primary function (except the backups) of db2183 to db2184 (T428467) so db2183 can lose connectivity for an extended period of time for this task. CC @FCeratto-WMF No further action is needed except downtime/remove downtime of db2183,db2198,db2198 before and after maintenance.
Done, db2184 is now the primary and db2183 is the replica/candidate, and now can lose network without issues for the parent task.
Tendril updated. heartbeat table cleanedup, orchestrator looks good.
Thu, Jun 4
Backups are now correctly configured. I would like to propose some ownership changes, though, on the next Data Persistence meeting to avoid issues like this in the future.
I don't think anyone is disputing that orthophotos can be educationally useful. The question is whether storing massive numbers of huge TIFFs on Commons is the best use of Commons infrastructure and budget.
Waiting for confirmation that database is included in a new backup run before resolving. Meanwhile, I will audit any other db that could have been created without requested backups configured.
Backups were not setup.
Tue, Jun 2
I tested remote backups, and packages seem to be in a working state, but cumin (a dependency) seem to not be working well or lacking extra setup. No worries, because I don't think I have any actionable on my side that would block this, but please do not remove cumin2002 yet because of it (nor I am in a hurry for the upgrade).
Fri, May 29
After the issues encountered with bacula versioning, the most important/difficult blockers are gone, now only waiting on generating extra backups before removing the old host with the old ones.
Thu, May 28
We need to reimage trixie host backup2014 back to bookworm, as all backup hosts (storages and directors) need to be on the same bacula version, or backups will error out with "3913 Bad use command"
I have only one suggestion regarding the ticket name and the framing (not the project itself, which looks to me like a great idea): I would pitch it as "having a dedicated cluster for thumbs only", which is what the description seems to imply, except that we may have limited resources and the cluster may have to share resources with the existing text one for the reasons stated, but the different domain establishes a "logical" separation but that way: 1) I wouldn't reject being able to get additional resources so early in the timeline for the main cluster and 2) it may be more attractive and relatively accurate as a summary despite later trade offs, so stressing the need to separate thumbs from originals rather than "moving" it. I would do the same for swift, pitching it as a separation, even if due to resource constraints it may end up sharing the same physical cluster hw in implementation time. Feel free to disagree with my opinion.
Verification has finished, all files transferred ok. I found some metadata issues while doing the checking, but those do not block the decommissioning of the old hosts.
Wed, May 27
db2183 will require stopping mediabackups in advance, to prevent losing metadata. I will take care of that.
Thanks for the heads up, @Marostegui
Mon, May 25
Thanks, then I will start the depool at 11:30 UTC in order to minimize backup lag.
Fri, May 22
I am not seeing any 429 from this source in the last 15 days, so tentatively resolving. Please reopen if you disagree.
Wed, May 20
Percona Toolkit, which this package is a fork of, doesn't depend on that for trixie: https://packages.debian.org/trixie/percona-toolkit so I don't know but my guess is it could be ignored on the package, although it would could just in case some update + repackage.
Tue, May 19
Thu, May 14
Hi, once an issue is resolved, discussion on it are expected to be kept to a minimum unless there is a chance to reopen the ticket because it hasn't been properly solved.
The only thing I did was purging following the instructions of Manual:Purge, didn't do anything else anyone else can do on their own.
Purging is a mediawiki thing, so technically it doesn't work with files, but it is my understanding that purging certain pages also purges the files referenced by it. Don't worry, it gets sometimes complex, and sometimes you have bad luck - e.g. if a partial outage or error happens during normal purging. Even if 0.0000001% of files get affected, that means many dozens of them happen to have that issue, and as long as it is reversible it something with have to live with (making caching purging ultrareliable would have a very large cost).
@ayounsi not urgent, but please ping me with a time of start of maintenance when you have it so I can do what I mention above in advance.
I did a few manual purges, can you check if that helped? I removed the Swift and commons tags because bypasing cache shows me the right files (I believe) on both datacenters.
Hi, @Jcubic thanks for the report. On upload of a new version, caches are normally purged from our content delivery network, however how much time it takes for that to propagate depends on different causes, from our servers to your browser. If that happens once to you, please do as @Aklapper mentions above and reload your page after clearing your browser cache or, if it still doesn't get fixed, you can also purge manually the cache from the page (https://www.mediawiki.org/wiki/Manual:Purge) it is not impossible that, out of the millions of purges happening on millions of pages every day on all of our datacenters one could fail or be delayed temporarilly.
Wed, May 13
backup2015 is part of the media backup hosts, I can stop media backups for codfw before the maintenance on my own.
May 8 2026
I sent an email to Grant with additional data I can share, as allowed by staff SREs.
We are working on this for production network backups, feel free to talk to me if this gets prioritized in the future.
Hey, @CWilliams-WMF - this is not a requirement, but it would be quite useful for you to link your phab account with your LDAP one at: https://phabricator.wikimedia.org/settings/user/CWilliams-WMF/page/external/ It makes people handle certain requests faster and finding you faster. :-)
May 7 2026
May 4 2026
All in favor of installing any tool necessary for profiling/debugging a specific issue, absolutely no problems there, but I would talk to Observability team about the gaps in observability in the long term, as existing tools are supposed to cover all essential needs regarding basic monitoring, and standarization would help everyone and prevent multiple tools doing the same thing.
Maybe it is related to grants, a.k.a. it fails for the icinga user (?).
BTW, the warning is also @ es1045 (3 in total).
Apr 29 2026
Apr 28 2026
db2141 is done in our side and sent to dcops, after the backups on the replacement were tested.
This is ready for dc ops.
How challenging would it be to restore to a different host than the one from which the backup was taken?
I thought my puppet code had a bug caused by a refresh not being enough to reload the tls configuration, but that wasn't the issue, the automatic refresh from puppet was enough; thus the problem I had above was due to the long-running client requests, not the server. Nevertheless, I did a full restart of the service for testing and then verified it all got the discovery2026 cert (checked with openssl for all open ports). All good on my side.
Apr 27 2026
I had a look when D.P. was added, but probably @Eevans should be the primary contact for cassandra related changes.
@Scott_French would it be possible to schedule with someone in your team a recovery test, around a 2 weeks to 1 month from now, to double check we are able to recover data as needed, and make sure we are backing up everything we need? I think that would be my only pre-requisite for resolving this (scheduling the test).
Resolving unless issues are seen after deployment- reopen if anything happens.
Apr 24 2026
This is now tested, will do the archiving in another task as part of decommissioning backup1003/backup2003 to prevent duplicate work.
FYI @Marostegui This is almost ready to proceed, but waiting on backups check until Tuesday to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1276407 and proceed.
Apr 23 2026
Apr 21 2026
New backups worked, they went from almost 10h and 2.2TB to 8 minutes and 23GB.
Apr 20 2026
Let me ask, while all data I have access is already anonymous, it is still user's private data, just osm wiki is the referrer. Let me ask what parts (in any) I can disclose for people without an NDA.
Apr 17 2026
I am then changing the tags to reflect this is not something we own, a different thing is if we want to remove it or fix it, but someone that uses it should come up and say how/why it is used.
CC @Marostegui just FYI this is what created some lag on m1 (it may happen in the future again), but will allow to remove some garbage from the db instead. You can unsub from the ticket after you read this or disable notifications.
On a trixie system I had an issue in which the host decided to kill big processes rather than becoming slow (something related to stall detection, not due to VM/swapping), as the debian default assumed more of a k8s-style management. Modifying a kernel parameter worked for me, in case that is relevant.
