I don't think httpbb tests should really break deployment, but rather ask for confirmation to the deployer if they intend to continue in case of issues.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Mon, Mar 25
Every time I log into the victorops portal I'm shown reCaptcha, and it's the only site on the internet that does, so it's not about the reputation of my IP/browser profile.
Mar 24 2024
In T295007#9655915, @Jeff_G wrote:In T295007#9655859, @Joe wrote:With the API change, you should be able to upload by url large files (up to 4 GB IIRC) without incurring in timeouts; and that will also allow us to move file processing to Shellbox, making our infrastructure more secure.
@Joe: How about 5 GiB per https://gerrit.wikimedia.org/r/1002813 ? See https://phabricator.wikimedia.org/T191804#9363066 for a discussion
about the capacity implications of this change.
Mar 23 2024
In T295007#9534555, @Yann wrote:What's the status of this task?
Mar 21 2024
There is a few reasons why we didn't migrate changeprop to use the service mesh, first of all the fact we don't want to define timeouts outside of it.
Mar 8 2024
@tstarling I think we determined that the expensive part of handling large files in shellbox was mostly the download/hash verification of the file received, I don't even think we strictly need PUT support.
Mar 7 2024
Mar 6 2024
Hi @Eevans, I'm a bit perplexed by why you think serviceops should be able to assist with this issue. This seems like an application bug triggered by external traffic, from the looks of it.
Mar 5 2024
@Scott_French great job, indeed I think we can just go forward and do it, even if there's a few things that look suboptimal.
Mar 4 2024
Re-uploaded the packages to the right components.
Mar 2 2024
In T249745#9592919, @aaron wrote:I couple things I wonder about:
- Though the bottleneck seems to be EventGate more than Kafka, I still wonder why profile::kafka::mirror::properties doesn't blacklist all MW jobs?* Is anything making use of that extra data?
Mar 1 2024
Apologies, I must have made a mistake. I will fix it as there are other issues as well.
Feb 29 2024
When we're talking about errors, it's always a good idea to reason in terms of error ratios and not error rates.
In T358634#9586029, @tstarling wrote:In T358634#9582468, @Joe wrote:I don't think it holds any ground for systems involved in live responses or which have strict latency requirements in general.
For instance, enabling swap might save a database from being OOM killed, but it will slow down databases to a halt, basically causing it to be unusable to serve live requests.
That's not true, according to the article I linked by Chris Down. If a database server is completely out of memory, it's better to swap than to go into a livelock, a state so severely broken that the power-cycling is seen as a reasonable solution. But either way, the problem is that you're running out of memory, not that you have swap.
Under light memory pressure, enabling swap allows more RAM to be used for file caches, improving performance.
In T358636#9586057, @Scott_French wrote:If we're comfortable with that being the default for the entire keyspace (i.e., even for new workloads) then that's a fairly straightforward config change [0].
In the meantime, I remembered why we were only replicating the /conftool keyspace:
Feb 28 2024
In T358636#9582669, @Volans wrote:That etcdmirror is mirroring only the /conftool keys it's totally news to me, I assumed it was replicating the whole content of etcd. But indeed it does not:
$ sudo etcdctl --endpoints https://conf1007.eqiad.wmnet:4001 ls / /conftool /spicerack /testvs
$ sudo etcdctl --endpoints https://conf2005.codfw.wmnet:4001 ls / /conftoolIs there any specific reason we shouldn't replicate it all? If replicating it all this issue should never happen right?
Spicerack follows the SRV records for the endpoint, so if that gets failed over to codfw spicerack should follow and I would have expected to find the locks replicated there too.
I don't think this is a data persistence issue, but rather it's much more probable this is actually a restbase bug. At the very least, investigation should start there.
The file has been removed from the puppet doc tree, so at least that's not a problem anymore.
I also want to note that on kubernetes memory is mostly managed by the k8s scheduler on top of the kernel one, so that we never have overflowing use of memory and we OOM containers (which are nothing more than cgroups) to control we never do. And the k8s scheduler also takes care of never allocating a portion of memory we reserver for system component, and we currently do so.
Focusing on the swap part of the problem, for posterity:
Feb 23 2024
Before we move into finding solutions, I'd like to understand better what is the goal we want to accomplish:
- If the goal is to make what we upload to packagist cleaner, composer.json has a way to do it (archive.exclude, IIRC?)
- If we prefer not to have pygmentize in the repository (why?), we can adapt the image build process to fetch it remotely
In T345274#9267674, @kostajh wrote:In T345274#9267646, @JMeybohm wrote:In T345274#9131624, @kostajh wrote:@Niharika @Tchanders any concerns with this?
Can we please get your feedback on this? It would help to decide whether to put additional work into keeping the similar-users chart up to date.
We are interested to use the service but can't commit to a specific timeline for when it would be connected to a MediaWiki extension interface, because we're sorting out the roadmap and timelines for Trust and Safety Product Team, and have a lot of ongoing projects.
Is it a lot of effort to keep the chart up to date?
This task was about HHVM-specific issues. Feel free to reopen if you think it's still valid.
Feb 22 2024
In T357949#9565748, @tstarling wrote:In T357949#9565402, @bd808 wrote:The .pipeline files are used to create container images for running in Wikimedia production or anywhere else that may choose to run Shellbox as a container.
You're saying that anyone anywhere who wants to run Shellbox in a container, whether they are using MediaWiki or not, should use the exact same configuration as WMF production, down to the PHP version, number of FPM workers, list of fonts to install, etc.?
@tstarling just removing the .pipelinelib directory isn't an option, if we want to keep using shellbox in production. I don't see a good solution for than other than maintaining separate branches and backporting changes to the main branch to a wmf branch, which would come with its own inconveniences of course.
Feb 21 2024
We still need to investigate/fix the regression on CLI for DjVuImage, see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/720143/comments/8f625c32_80c745cd
Feb 20 2024
I built and uploaded the packages to apt, which means that next week we should automatically roll them out to kubernetes in the weekly rebuild of base images.
Feb 16 2024
In T357595#9544806, @RLazarus wrote:the intention was probably for this to match something a bit more restrictive (e.g., matching ^/wiki(/.*)?$)
I'm looking at the diff where the RewriteRule was introduced, and I'm pretty sure this is right -- it's replacing a
ProxyPass /wiki fcgi://127.0.0.1:9000<%= @docroot %>/w/index.php retry=0so we went from a path argument to a regex. But some history from @Joe would help to confirm. And we'd still need to make sure nothing else has started relying on this in the last five years.
Feb 12 2024
@Tacsipacsi I would ask you to keep emotions in check, it's very hard to collaborate (which is what we are supposed to be doing here) when someone is that confrontational. Specifically:
- yelling doesn't get your point through better, quite the opposite. In fact, took me quite a bit to understand what you were trying to say.
- raising tasks to UBN! because you're in disagreement also won't get your point through better,.
Feb 8 2024
Gitlab needs regular maintenance windows, at least once a month, if not more often and they usually last around 15 minutes.
Feb 7 2024
My spot tests show transcodes now work on testwiki. I'll let @brion take a look as well before I declare this bug resolved.
Feb 6 2024
While the original issue is solved, we're running into a new one - now the script exits with an OOM, and unless I'm reading it incorrectly, the memory limit seems to be just 4 MB which would explain the problem.
Hah, I found the issue.
That is not the problem, or not the only one for what it's worth; I've tested encoding an audio file and I get the same error, and there is no TMH_OPT_VIDEOCODEC set there (see reqId in Logstash).
I think the '\\''--env=TMH_OPT_VIDEOCODEC='\\''\\'\\'''\\''vp9'\\''\\'\\'''\\'''\\'' '\\'' looks quite suspicious here.
Jan 31 2024
Given the chosen size is both non-standard (meaning it's not used on most large wikis) and not in the list of thumbnail sizes we pregenerate at upload time, I would imagine switching enwiki to this new thumbnail size would have a big impact on both the upload edge clusters and the backend object storage.
In T355292#9468360, @TheDJ wrote:
Low priority as we've found out the problem is fundamentally deeper than we'd like.
Jan 30 2024
Just stating for the record that connection refused/reset messages will come from our edge caching layer, specifically from the tcp stack of our servers there, so it wouldn't be related to a migration to kubernetes (which is still only partial, btw).
Jan 25 2024
It generally seems ok, but a few considerations:
- kafka-main is much smaller than kafka-jumbo, and critical to site operations
- The codfw.mediawiki.currussearch.page_rerender.v1 topic is pretty large at the moment, 292 GB in codfw and 149 GB in eqiad, while the corresponding eqiad topic is as expected tiny/irrelevant.
Jan 23 2024
In T355619#9479984, @hashar wrote:@Paladox is my de facto point of contact for patches I write for Gerrit. He is quite speedy, productive and has valuable reviews.
Please let's make apt-merge less cumbersome to use.
Jan 18 2024
Adding @brion as the resident expert / maintainer of TimedMediaHandler. I'd like to get your opinion on how hard it would be to port WebVideoTranscodeJob to use shellbox :)
In T355243#9467924, @kostajh wrote:@Joe do you want us to stop the script for now, and switch to not using the job queue?
In T355243#9467881, @kostajh wrote:If T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis is the cause, then I am unsure if this is an unbreak now, as that code has been running since January 5 (see https://grafana.wikimedia.org/d/STSXVVdSk/mediamoderation-photodna-stats?orgId=1&refresh=5m&from=now-90d&to=now)
EDIT: It looks like actually the metamoderation script actually spawns jobs, as interestingly all errors seem to come from the same reqId, 8e1438850af4ec4c4b82ebb6, and clearly it's not ThumbnailRenderer jobs.
Jan 16 2024
Jan 11 2024
Jan 8 2024
It seems to me that trying to respond to 1k rps with a concurrency of 2 is probably the issue. Throttling is bad because it raises latencies; if any measure we take to avoid throttling increases latencies compared to throttling, why bother?
Jan 5 2024
In T350507#9437424, @Jgiannelos wrote:The snippet from the cassandra-http-gateway helm chart is not using keyspace/tables (because its user submitted at runtime from what I understand from the project). I updated my patch with this expected config structure:
cassandra: hosts: ["127.0.0.1"] port: 9042 local_dc: "datacenter1" authentication: username: "cassandra" password: "cassandra" caching: enabled: false cassandra: keyspace: "tests" storageTable: "storage"
Jan 4 2024
Jan 3 2024
I suggest we standardize on the configuration that we've used for the golang applications using cassandra.
Yes, your understanding is correct; I had a patch fixing this that never got merged, I should just make a new version of that.
Dec 14 2023
Dec 11 2023
In T348284#9395386, @RLazarus wrote:Yeah, good point. Fortunately it looks like a pretty straightforward Go patch to add a --namespace flag if need be, and pipe it through to the API call.
Dec 7 2023
In T352650#9388845, @VirginiaPoundstone wrote:@Joe thanks for thinking this through. I have three follow-up questions:
Or not :)
Dec 6 2023
We intend to try to take a stab at this during next week's MediaWiki CodeJam.
Dec 5 2023
Almost anything relevant internally uses envoy to mediate TLS both client and server side, so it's probably useful to list the oddballs.
Raising the priority because at best Special:landingcheck is using data from 1 year and a half ago. Given we're entering our main fundraiser, this should probably be solved?
Dec 4 2023
@Ladsgroup I think the log linked by @TheresNoTime is a typical example of a distributed transaction going wrong:
In T352569#9377876, @Base wrote:Please next time allow the designated time of at least two weeks
I wonder in what fraction of situations an organiser would know the IP 2 weeks in advance. If that is at least 30% I would be surprised.
In T352569#9377673, @Kizule wrote:@Joe I'm guessing that comment is to the author of this task. Are you going to schedule a patch for the backport, or do you want me to take care of it for you? :)
Please note that usually a lead time is requested for throttle requests. Opening the task on a friday for something needed on monday (and adding the relevant information needed only on saturday) isn't the right way to plan things.