Thu, Nov 26
After reviewing some more details:
www-data is already present, but its home directory doesn't exist, so there is no use in re-using it. We're also *not* actively using it, as we're setting the home directory as /srv/app anyways. So we're ok with creating a duplicate user in this case, even if it feels weird.
I did set the priority to 'high' because this is a blocker to the production deployment of shellbox.
Wed, Nov 25
Tue, Nov 24
ok, found the problem, and it's slightly embarassing:
Mon, Nov 23
Tue, Nov 17
the apache change has been merged, and tested to work with the renewed httpbb test suite. It will be deployed everywhere in the next 30 minutes. Please confirm this resolves this issue.
I didn't notice the task was opened again, I will decline again as I don't see a need for a node 14 package to use in production. CI can easily use the binaries from nodejs.org, as I stated above.
While nothing above seems to contradict that we don't have a compelling reason to install node 14 packages for production now, I want to make a point: developers don't just decide to use a newer version of some software in production in a void.
Mon, Nov 16
Not sure what this task rationale is.
Fri, Nov 13
From the diuscussion on VP:T I would assume the editor just hit the maximum amount of templates to include in a single article. Lowering priority and removing traffic from the task, adding what I think is the actual problem.
Thu, Nov 12
I strongly doubt the problem happens at the traffic layer. This seems to be a different kind of problem - maybe those pages once edited overflow some specific limit.
The current situation is:
After some digging around Special:Newfiles searching for a pattern in the files that fail, and I think I found a smoking gun:
Not many people are around, and most importantly no one with extensive WanCache experience. @ArielGlenn and I are thinking of merging the patch and testing it on a debug server to see if it fixes the issue for files where we see it.
@AMooney I don't think that task will be fast to complete, also because we should really dedicate our energies to transitioning MediaWiki to Kubernetes, and freeze all the rest of the work that's unrelated for now. Adding the env variable to the jobrunners is matter of one simple patch, and I'll tak that on.
Tue, Nov 10
Gentle nudge, this really needs to be completed.
Mon, Nov 9
Proposal #4 and #5 would directly clash with stuff we do in production, and/or create confusion as to what services are for cloud and which aren't, and I strongly oppose them.
Fri, Nov 6
For the ICU transition it's crucially important that no machine with write access to the databases gets updated before the date of the migration. So please do not test this on the eqiad mwdebugs for the time being.
Thu, Nov 5
As for the manifests, if we need them, they should be in hellfile.d/admin I guess?
Wed, Nov 4
I think that while we should try to avoid such a situation, mandating we either roll forward or back by policy would just be removing the ability for people managing releases to make a judgement call, which is almost never a good idea. And this is not counting that in some cases, rolling back after days might be slightly problematic.
calico/node is the only more-than-slightly-worrisome thing here. for everything else we're probably ok using their builds for the time being. It's also true that as far as external images go, a redhat certified image is probably one of the most "secure" options we can find:
Tue, Nov 3
Mon, Nov 2
I typically prefer if we rebuild images from dockerfiles, using our base images. That gives us a tad more control over upgrading in case of a disaster security hole in e.g. alpine linux.
Oct 30 2020
Oct 29 2020
As far as mc2029 is concerned, you can just proceed without any impact.
Oct 28 2020
Oct 23 2020
I added what I think is an outline of the work we still need to do in order to make this process safe and efficient in T243009#6574045
Sorry, I see that given I didn't express my thoughts linearly, some confusion ensued.
Oct 21 2020
Also I want to clarify: we can reduce the pain as much as possible, but for the duration of the transition phase, it will be somewhat more work than we're used to for this kind of changes. There is no way around that that I can think of.
Regarding the apache httpd container, I am approaching layering as follows:
Oct 20 2020
Cassandra is not absent of its own issues, and it has a much higher cost per GB than parsercache currently has (I did no research, but I suspect it's in the order of 5x at the very least).
Oct 19 2020
@dduvall I like the idea of using scap prep to extract the code from the images, I didn't think of inverting the logic like that but it's surely workable.
Additional datapoint that was required: we should be sending ~ 10/15k messages per second to the central log server, depending on traffic.
I think there are just a few dangling services that are managed by the analytics team, specifically:
While those urls weren't my original report (which was about truncated URLs), it seems the behaviour has in the meantime changed, but not in a correct way.
Oct 16 2020
@srodlund the post is ready for a thorough review :) I shared the document with you, let me know what you think :)
Oct 15 2020
Oct 13 2020
This needs to be done while we have one DC turned off for most traffic as we do right now, IMHO.
Oct 12 2020
I just want to comment that this server had its root directory filled up today, and it's in a strange state where only 13 GB are found by du -xsh /, but 53 are occupied on /dev/md0 according to df. Given there are no huge deleted files I can see, it seems possible the server has some leftover data under /srv on the root partition that is now overwritten by the mountpoints.
The situation of the memcached servers is amazingly telling of how deployment-prep is unmanaged, and we should really dedicate some resources to it, but also work better when we do stuff there.
Oct 9 2020
Adding some notes after yesterday's meeting:
I think some form of ratelimiting for that should be present in restbase, and in general, we should ratelimit calls to uncachable URLs to volumes we can support.
Oct 8 2020
Oct 7 2020
To clarify a bit - restbase has hourly spikes of requests for the feed endpoint, which go back to wikifeeds, which calls both restbase and the action api.
Oct 6 2020
Oct 5 2020
Oct 2 2020
Small status update:
Checkin in to report that calls from OKAPI have stopped tonight. Thanks @RBrounley_WMF (and the team)!