The problem we're trying to solve here is not an individual cache key read, or even multiple cache key reads, but more generally to smooth any read swarm we have. For now this is an experiment, and I don't think (out of our past experience) that caching keys with a TTL lower than the db max lag could really cause issues of consistency. I will go further: if this could cause consistency issues, then we're using memcached as a database with some guarantees of consistency and durability, neither of which is true, and we need to go back to question that.
I have one question: I see you're using AWS for building the project. This raises one fundamental question: Do you plan to just run a prototype on AWS, or is that intended to be used in production as well?
Resolving this as we have no more services with weight 0, and now "pool" should correctly refuse to pool a service if the weight is zero
Wed, Jun 3
@Krinkle I think it's perfectly ok to not use changeprop - I just wanted to get some clarification as to why to be in the RFC so that that analysis is explicit and documented. I have no concerns regarding the RFC as it is.
My tests went fine:
- mwdebug* servers got the datacenter-appropriate pool-$dc-testserver user
- the deployment servers got the conftool user instead
We're moving to purged using kafka, so we will set up alerting based on that, rather than on htcpd
Thu, May 28
This happens because the prometheus role somehow includes conftool::scripts but not profile::conftool::client.
The deploy strategy is simply adding the new users to etcd, move most hosts to use conftool as the root user immediately, and then progressively move them to the new users system on the long run.
Picking this up again - we already migrated the CDN to use https - do we need to do something for CI?
Just out of curiosity, what's the problem with running scripts from the deployment host, other than "we prefer if they're run on mwmaint for consistency"?
Wed, May 27
Setting priority to "high" as the failed disk was also used in JBOD configuration for cassandra, which is not failing to start.
Mon, May 25
The package has been uploaded.
As of today, all appservers use envoy too.
I've been monitoring the status of new images in the following way:
Fri, May 22
Thu, May 21
I think I will try to implement the following RBAC schema:
Wed, May 20
Status update: we've deployed envoy on all mediawiki servers with the exception of:
- jobrunners (where we still have to reproduce what nginx was doing)
- all servers in the appserver cluster in eqiad with a sequence number above mw1275.
Looking at kafka, it seems there is a bizarre pattern in producing the data to the "netflow" topic:
So, while I find the idea of using poolcounter to limit the editing concurrency (it's not rate-limiting, which is different) a good proposal, and in general something desirable to have (including the possibility we tune it down to zero if we're in a crisis for instance), I think the fundamental problem reported here is that WDQS can't ingest the updates fast enough.
- "The above suggests that the current rate limit is too high," this is not correct, the problem is that there is no rate limit for bots at all. The group explicitly doesn't have a rate limit. Adding such ratelimit was tried and caused lots of issues (even with a pretty high number).
Tue, May 19
I think we're at the point where it would be best if we could change the logic of our testing, and use docker directly, so that we can split tests into different images.
Mon, May 18
Status update: purged is now consuming purges from restbase directly via kafka and not via multicast anymore. This should unblock the complete migration of changeprop to kubernetes, amongst other things.
Thu, May 14
Wed, May 13
I took a brief peek at what flows to systemd from php-fpm on dbus:
I just realized, the problem is the typo buTster I made in the commit. So changing the priority accordingly, I'll rewrite the text of the bug as the problem is different :)
Setting priority as "high" as this is blocking a project.
FWIW, I think I remember systemctl status php7.2-fpm to stall on a busy server, but I might remember incorrectly.
Tue, May 12
This change was released to production to all wikis yesterday.
Mon, May 11
We ran this test, and it passed with flying colors:
- A transient peak of memcached errors, lasting less than 1 minute
- The gutter pool picks up the slack pretty fast
- No noticeable effect on latency.
- The cache hit ratio on the gutter pool was good (88% after less than one hour in the pool, but probably capped around that value by the 10 minutes TTL)
- As soon as the server became available again, the memcached traffic went back quickly but not instantly, in the span of ~ 2 minutes. This also eases the risk of thundering herds from the deletes that get replayed.
Fri, May 8
I don't think we need the request_id to be preserved - purged is definitely not the place where to do analysis of such data.
May 6 2020
@elukey let's schedule this test for 6:00Z on monday, May 11th?
While it's clear that 400 alerts flooding production are not great, this check is important for each single machine. So we can aggregate the output, but we can't suppress it. We need to know *very clearly* if one single machine is running an outdated version of mediawiki.
So I second the aggregation if it's possible to show clearly in the icinga alert which machine (or machines, if the number if below, say, 90% of all mw servers) is failing the check.
Apr 30 2020
After a discussion on the patch, it was clearer to me that some information can't be removed from the message, and that makes resource_change the perfect fit for our use-case.
Looking at our existing event schemas, resource_change has all the information we need, but also much more. We would like to get a much smaller object to transmit, and specifically we only want to define:
In another case, we had ~ 100 errors corresponding to a spike in latency from the backend:
Apr 29 2020
At a later time, we could think of changing the logic, and make purges avoid race conditions, removing the need for the rebound purges.
One way to implement this would be the following:
- No more changes are needed at the application layer
- All purged servers join a single consumer group per datacenter. This will ensure each purge message is consumed by only one purged instance.
- This instance will take care of sending the purges to all the cache backends in the DC first, and to all the frontends afterwards
Since purged is now in production, and that we have some work ongoing that will reduce the amount of purges we send (T250261), I think it's time to revisit the idea of moving purges to Kafka. This would also help with the transition of change-prop to kubernetes.
I think we should run 3 different tests, and I would run them for 1 host first.
- Stop memcached completely
- drop all packets directed to port 11211
- drop a percentage of packets incoming and outgoing
A general observation first: It's not clear which application will be responsible of storing subscription data. I would assume, if we expect multiple possible sources of subscriptions, that those sources would keep track of their own subscriptions. But I can see arguments in the other direction - for example, maintaining those in a centralized place would make it easier for people to manage them. Anyways, this should be clarified in the RFC.
I have a few questions and observations about this RFC, but let's start from the basics:
Apr 28 2020
Some more data:
Just to be clearer: we achieved a much larger improvement in the average latency of requests by switching to persistent connections to our session storage:
First, the results of the real traffic test. These are averages over 10 minutes, starting after 20 minutes of having both servers pooled. This is an attempt at smoothing out the effects of very slow queries at higher percentiles, that can be traffic dependent.
Assuming we'll be ok with restarting php-fpm at every release, I reduced the amount of strings memory and opcache allocated on mw1407 from the values in the puppet patch. I am now using 300 MB of interned strings cache and 3.3 GB of opcache space. These figures can be reduced further probably.
Apr 27 2020
After running a few benchmarks on mw1407 (where LCStoreStaticArray is used) vs mw1409 (which uses cdb files), it seemed the change made little to no difference for the following urls:
Apr 20 2020
Apr 17 2020
Apr 15 2020
I would frankly prefer to pass a flag to getCdnUrls, and return those dependent urls only if the flag has its default value. I say this because it won't make Title significantly more heavy, and at the same time it will fix *quickly* a problem we have in production.
Apr 14 2020
This is now resolved, I see no further errors since my latest change was merged.
I've added some further retry logic for requests to parsoid, this *might* help.
Changing priority as this seems to be highly user visible.
Ehm, phab UI fail.