I would object to using any build step that depends on downloading assets from the internet at runtime. We do that at the moment for many projects and we're aware it's completely wrong and needs fixing.
Thu, Apr 29
There is definitely something going very wrong with memcached:
Given we only make requests to external storage when parsercache has a miss, it seemed sensible to look for corresponding patterns in parsercache.
Wed, Apr 28
Tue, Apr 27
To be clear, the idea came out of the fact that during read-only time we had a lot of jobs failing, but given we actually retry the jobs, we should not need to actually.
Mon, Apr 26
Wed, Apr 21
Mon, Apr 19
Fri, Apr 16
Thu, Apr 15
Thanks @Papaul! We'll now work on service implementation.
Wed, Apr 14
joe@wotan:~/Sandbox/mw-on-k8s$ kubectl get pods NAME READY STATUS RESTARTS AGE mediawiki-test-6fb67b5f8b-2nwqh 6/6 Running 5 3m44s
Tue, Apr 13
I don't realistically see it possible to switch memcached to TLS in the remaining time before we need to renew the certificates, hence raising priority. It will be raised to UBN! in a couple days.
Thu, Apr 8
Mar 24 2021
We've had all supportive services serving from codfw for the well announced rebuild of the eqiad kubernetes cluster (see email to wikitech-l). Numbers should be again unaffected once we migrate back.
Mar 23 2021
Trying to break down my current thoughts:
At 15 workers per pod, we get 5 pods per node (6 if we only reserve 5% of ram and cpu). That's more or less the maximum concurrency at which the sweet spot holds for php-fpm. It gets us either 75 or 90 workers per node, and I think it would be a net win. I will update the task once I have more realistic numbers.
A typical appserver has 96 GB of memory and 48 cores. Let's assume we can use up to 85% of those with pods, which looks a bit conservative, but it's ok for our current calculations.
Some data from one appserver:
- httpd uses less than 1 GB of memory and 1 cpu. If we assume we'll reduce the number of workers, it can be safe to assume e.g. 600 MB and 0.6 CPUs are ok
- mcrouter uses around 300 MB of memory. Again this would be reduced if it's inside the pod, down to ~ 200 MB should be safe. 1 CPU is enough for a whole-host mcrouter, so we can assume 0.5 CPUs should be enough
- nutcracker currently uses 200 MB of memory + 0.1 cpus
Mar 22 2021
Given it has created some doubts, let me clarify: I've created a first version of the charts that implements solution 1 (and not a complete version of it, either).
Mar 18 2021
As far as mcrouter goes, the only non-brittle solution is to run it inside the pod, so solution 1. The reason is simple: restarting mcrouter and/or it crashing on the node or in a daemonset would make it unavailable for all the pods on the node, without MediaWiki *or* kubernetes noticing.
Triaging as high priority as this is at best going to make building the images fail, at worst it's a security liability.
Mar 17 2021
Using hostnames in mediawiki-config is not really an option.
FWIW, the document on wikitech is not authoritative - service::catalog in hiera is.
FWIW, ServiceOps decided against using a full mesh networking for our services because we considered istio to be both very complex and not really needed for our level of complication.
Mar 16 2021
The best practices I am talking about are, basically:
After some more work, this is my ideas for liveness and readiness probes:
Mar 15 2021
Ok I'll try to re-summarize my argument:
the problem we're trying to solve is having transactional consistency between mediawiki and kafka. And we want to do it not at the application layer, but at the data layer, which is what I think is wrong for a few reasons. But before we go back to discussing solutions, I'd like to see a better explanation of the problem.
Reading the whole history here it seems that the problem we want to solve is a traditionally unsolvable one (keeping two logically-distinct datastores perfectly consistent in a distributed architecture while being not in a CP setup.
Mar 11 2021
Mar 10 2021
Hi @valerio.bozzolan, from this tweet
Mar 9 2021
Ideally, the liveness probe needs to check if the container is running (more or less), while the readiness probe should check that the service is still responding.
Mar 5 2021
For the record, we're now building the actual multiversion images of mediawiki, it would be interesting to do all testing using those. In particular it's interesting imho to work on the layering so that we reduce the number of layers we need to download for each release.
systemd-memcached-wrapper is a perl script, an evolution of the old wrapper script debian always used and that caused me more headaches than it solved. I'd very much prefer we keep the approach we took with our systemd unit back in the day (while it might make sense to switch to use user memcached for the reasons above)
Mar 4 2021
Mar 3 2021
Can I ask how do we intend to perform the transition from non-tls to tls in detail? I see a series of pitfalls with our current setup and the code I see in puppet, but please be explicit about the steps you want to take to switch one server to enable tls.
Indeed you should enable the mwapi-async listener and then:
- change MEDIAWIKI_API_URL to http://localhost:6500/w/api.php
- send the Host header with the actual wiki host you want to reach
After more analysis, this is my understanding of the outstanding problems:
- pontoon was connecting without password as logmsgbot, and given the nick has no enforce on, it will just lie around, causing tcpircbot to be unable to connect
- this also caused problems with freenode's spam protection so even after ghosting the user, we had issues coming from that
- any attempt to !log while the bot is reconnecting (it can take up to 2 minutes) will crash the bot.
We found a few other issues:
- The nick has no enforce, thus a random instance running in labs is connecting (obviously without password)
- Nickserv says it saw the user the last time at when the issues started:
NickServ (NickServ@services.): Last seen : Mar 03 00:10:46 2021 (7h 45m 43s ago) NickServ (NickServ@services.): User seen : Mar 03 07:47:30 2021 (8m 59s ago) [this is me ghosting the user]
The issue is more general: tcpircbot crashes on every invokation in the following way:
Mar 2 2021
This task has definitely nothing to do with serviceops specifically.
Mar 1 2021
At the meeting we decided it's ok to let apache log to kafka as a main method of collection. We will therefore, at least in a first iteration:
@RLazarus in https://phabricator.wikimedia.org/T248093#6076630 you mentioned committing a script for automating cert renewal, and I see it indeed. Renewing the certs should amount to just running the script, correct?
I can't imagine a single valid reason for a distro upgrade meaning that data transfer would slow down so much.
Feb 25 2021
I got a few reports of bots not having more issues, I would consider the immediate problem solved.
Further update: in the next 10 minutes, after the dust of resharding settled, we just had 36 errors, which seems more than acceptable.
After merging my change, the number of errors in OAuth.log regarding 'nonce already used' decreased from ~ 80/minute to ~ 17/minute, which seems to be in line with the rate we had before the incident.
I agree that it would make sense for anyone with global root to also be able to manage CI, but it was a delibarate choice back in the day AIUI to exclude global roots.
Feb 24 2021
No, that entry is for testreduce, so another test instance too. So I doubt that what you're seeing in the logs has anything to do with this setting.
Just for the record, the restbase cluster that has ipv6_compat activated is the dev cluster. Nothing serving production traffic.