User Details
- User Since
- Oct 3 2014, 5:57 AM (499 w, 1 d)
- Availability
- Available
- LDAP User
- Giuseppe Lavagetto
- MediaWiki User
- GLavagetto (WMF) [ Global Accounts ]
Mar 25 2024
I don't think httpbb tests should really break deployment, but rather ask for confirmation to the deployer if they intend to continue in case of issues.
Every time I log into the victorops portal I'm shown reCaptcha, and it's the only site on the internet that does, so it's not about the reputation of my IP/browser profile.
Mar 24 2024
Mar 23 2024
Mar 21 2024
There is a few reasons why we didn't migrate changeprop to use the service mesh, first of all the fact we don't want to define timeouts outside of it.
Mar 8 2024
@tstarling I think we determined that the expensive part of handling large files in shellbox was mostly the download/hash verification of the file received, I don't even think we strictly need PUT support.
Mar 7 2024
Mar 6 2024
Hi @Eevans, I'm a bit perplexed by why you think serviceops should be able to assist with this issue. This seems like an application bug triggered by external traffic, from the looks of it.
Mar 5 2024
@Scott_French great job, indeed I think we can just go forward and do it, even if there's a few things that look suboptimal.
Mar 4 2024
Re-uploaded the packages to the right components.
Mar 2 2024
Mar 1 2024
Apologies, I must have made a mistake. I will fix it as there are other issues as well.
Feb 29 2024
When we're talking about errors, it's always a good idea to reason in terms of error ratios and not error rates.
In the meantime, I remembered why we were only replicating the /conftool keyspace:
Feb 28 2024
I don't think this is a data persistence issue, but rather it's much more probable this is actually a restbase bug. At the very least, investigation should start there.
The file has been removed from the puppet doc tree, so at least that's not a problem anymore.
I also want to note that on kubernetes memory is mostly managed by the k8s scheduler on top of the kernel one, so that we never have overflowing use of memory and we OOM containers (which are nothing more than cgroups) to control we never do. And the k8s scheduler also takes care of never allocating a portion of memory we reserver for system component, and we currently do so.
Focusing on the swap part of the problem, for posterity:
Feb 23 2024
Before we move into finding solutions, I'd like to understand better what is the goal we want to accomplish:
- If the goal is to make what we upload to packagist cleaner, composer.json has a way to do it (archive.exclude, IIRC?)
- If we prefer not to have pygmentize in the repository (why?), we can adapt the image build process to fetch it remotely
This task was about HHVM-specific issues. Feel free to reopen if you think it's still valid.
Feb 22 2024
@tstarling just removing the .pipelinelib directory isn't an option, if we want to keep using shellbox in production. I don't see a good solution for than other than maintaining separate branches and backporting changes to the main branch to a wmf branch, which would come with its own inconveniences of course.
Feb 21 2024
We still need to investigate/fix the regression on CLI for DjVuImage, see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/720143/comments/8f625c32_80c745cd
Feb 20 2024
I built and uploaded the packages to apt, which means that next week we should automatically roll them out to kubernetes in the weekly rebuild of base images.
Feb 16 2024
Feb 12 2024
@Tacsipacsi I would ask you to keep emotions in check, it's very hard to collaborate (which is what we are supposed to be doing here) when someone is that confrontational. Specifically:
- yelling doesn't get your point through better, quite the opposite. In fact, took me quite a bit to understand what you were trying to say.
- raising tasks to UBN! because you're in disagreement also won't get your point through better,.
Feb 8 2024
Gitlab needs regular maintenance windows, at least once a month, if not more often and they usually last around 15 minutes.
Feb 7 2024
My spot tests show transcodes now work on testwiki. I'll let @brion take a look as well before I declare this bug resolved.
Feb 6 2024
While the original issue is solved, we're running into a new one - now the script exits with an OOM, and unless I'm reading it incorrectly, the memory limit seems to be just 4 MB which would explain the problem.
Hah, I found the issue.
That is not the problem, or not the only one for what it's worth; I've tested encoding an audio file and I get the same error, and there is no TMH_OPT_VIDEOCODEC set there (see reqId in Logstash).
I think the '\\''--env=TMH_OPT_VIDEOCODEC='\\''\\'\\'''\\''vp9'\\''\\'\\'''\\'''\\'' '\\'' looks quite suspicious here.
Jan 31 2024
Given the chosen size is both non-standard (meaning it's not used on most large wikis) and not in the list of thumbnail sizes we pregenerate at upload time, I would imagine switching enwiki to this new thumbnail size would have a big impact on both the upload edge clusters and the backend object storage.
Low priority as we've found out the problem is fundamentally deeper than we'd like.
Jan 30 2024
Just stating for the record that connection refused/reset messages will come from our edge caching layer, specifically from the tcp stack of our servers there, so it wouldn't be related to a migration to kubernetes (which is still only partial, btw).
Jan 25 2024
It generally seems ok, but a few considerations:
- kafka-main is much smaller than kafka-jumbo, and critical to site operations
- The codfw.mediawiki.currussearch.page_rerender.v1 topic is pretty large at the moment, 292 GB in codfw and 149 GB in eqiad, while the corresponding eqiad topic is as expected tiny/irrelevant.
Jan 23 2024
Please let's make apt-merge less cumbersome to use.
Jan 18 2024
Adding @brion as the resident expert / maintainer of TimedMediaHandler. I'd like to get your opinion on how hard it would be to port WebVideoTranscodeJob to use shellbox :)
EDIT: It looks like actually the metamoderation script actually spawns jobs, as interestingly all errors seem to come from the same reqId, 8e1438850af4ec4c4b82ebb6, and clearly it's not ThumbnailRenderer jobs.
Jan 16 2024
Jan 11 2024
Jan 8 2024
It seems to me that trying to respond to 1k rps with a concurrency of 2 is probably the issue. Throttling is bad because it raises latencies; if any measure we take to avoid throttling increases latencies compared to throttling, why bother?
Jan 5 2024
Jan 4 2024
Jan 3 2024
I suggest we standardize on the configuration that we've used for the golang applications using cassandra.
Yes, your understanding is correct; I had a patch fixing this that never got merged, I should just make a new version of that.
Dec 14 2023
Dec 11 2023
Dec 7 2023
Or not :)
Dec 6 2023
We intend to try to take a stab at this during next week's MediaWiki CodeJam.
Dec 5 2023
Almost anything relevant internally uses envoy to mediate TLS both client and server side, so it's probably useful to list the oddballs.
Raising the priority because at best Special:landingcheck is using data from 1 year and a half ago. Given we're entering our main fundraiser, this should probably be solved?
Dec 4 2023
@Ladsgroup I think the log linked by @TheresNoTime is a typical example of a distributed transaction going wrong:
Please note that usually a lead time is requested for throttle requests. Opening the task on a friday for something needed on monday (and adding the relevant information needed only on saturday) isn't the right way to plan things.