Having confirmed pybal is not an issue here, while databases might have a part in the issue, I'll change the tags accordingly.
While we have to wait and see if the absence of php7 traffic improves the situation (and in that case, why is that the case), I've noticed one fact about the latest cases of such instabilities:
- They have a correspondence between the fetch time reported by pybal and what the appservers report
- Looking at logs, the effect seems to be much more prominent in the last two days (starting on May 20th): on couple random application servers, the number of timeouts (e.g. requests taking more than 5 seconds to complete) more than doubled compared to preceding periods
- Again looking at apache logs, the slow requests around the time of pybal's show the problem is concentrated on enwiki. All data from a single appserver:
- enwiki constitutes 35% of requests of the total in the 10 seconds around pybal's detection; of those, 22% took 5 seconds or longer to complete
- In the same time period, only 0.3% of all requests not going to enwiki took longer than 5 seconds
- Verifying across the fleet with some cumin sorcery, it seems almost all requests taking more than 5 seconds at the time were on enwiki
Switching off php7 confirmed it's the cause of the increased number of GETs.
So, after turning off php7 this morning we saw no modification in the rate of requests to mc1033.
@kostajh for now I'm switching off php7 for other investigations, so we will know immediately if the additional traffic is due to that or not.
While there is no evidence that the increase in traffic sent to php7 is the cause of this increase in errors, there are a couple tasks that we'd need to investigate better and that could point at some form of resource starvation due to the coexistence of php7 and HHVM. See T223310 and T223647. So my first order of business will be to turn the php7 sampling back to zero for a couple days, to see if that changes the status quo for pybal's checks.
I think I know what happened here - and it's possibly in relation with T223180 .
Hi @Krinkle the metrics for php7 exist already, they're exported to prometheus as follows:
Wed, May 15
I changed the title of the task to reflect myt findings, and changed the associated tags accordingly
Hi, I've tested a few combinations of errors, and the the only case where this happens is when you choose action=segfault.
A net effect of this patch is now all logged-in users are back to HHVM. I think we need to backport the patch above to the running versions of MediaWiki so that we also sample logged-in users (who constitute a sizeable part of the backend requests anyways).
It's working in production because we connect to external URIs via a proxy, hence we don't need ca-certificates.
Tue, May 14
EtcdConfig in MediaWiki has been extensively tested against failures before it was introduced, if that's what @jcrespo was referring to.
Mon, May 13
Should we backport this to wmf.4? This is blocking further deployment of PHP7.
Fri, May 10
Going deeper, this template looks up on this and other tables:
And indeed, this page I just created containing only the use of that template:
from the original request's parsing report:
Just did some tests in production. The aforementioned page will render fine if I raise the memory limit to 2 GB. It also takes 25 seconds to render vs the 5 seconds it takes with HHVM.
Participants selection should be based on need and not on a fixed quota of participants.
Given this is the Wikimedia tech conference, I wouldn't consider central working on defining the relations between the Wikimedia movement and third-party users of Wikimedia products. I'm not against having it as a theme, but it's by far not the first thing I would talk about.
Thu, May 9
The way to go for such things is to use role::beta::docker_services on a fresh VM.
The only caveat is that apparently you need to rrun puppet once, run apt-get update, run puppet again to make it work.
Ok so I can now confirm:
I don't really know why stretch won't work, are we sure that's the case?
Please also note you can run multiple services on the same VM if you really want to, it's enough to add a second stanza in the hiera definition.
There is a simple solution to run services that are now on k8s on deployment-prep:
So what happened is that the server ran out of available memory and OOM'd.
Wed, May 8
We have done some work to overcome this sporadic failures:
Tue, May 7
@Krinkle should we raise the memory limit slightly? I can do some tests, but apart from that what remains to be done?
Hi, this is considered a blocker for further deployment of php7. @Krinkle do you think the patch you merged tonight solves the issue and can unblock further deployments?
I would even suggest if we write a puppet-lint plugin for this to add the fix capability. It should allow a relatively quick removal of all hiera() calls.
Mon, May 6
So a couple things:
Fri, May 3
some things from my very initial analysis:
- I tried to purge first the directory that the deployment had invalidated, the error didn't go away
- I tried purging the autoload file in composer that was suppsedly loading the library from the wrong place, still no dice
- I finally tried purging all of the opcache, which solved the problem.
Thu, May 2
Mon, Apr 29
Mon, Apr 22
@kchapman regarding point 1 above - I've prepared various patches, including https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/505487 that should act as a stopgap solution for now.
Apr 20 2019
I realized I failed to update this ticket with my investigation:
FWIW, the doc_root being set was causing severe issues under php7.2. I removed it from the list of ini settings we want to use, and from the actual config as well.
Apr 19 2019
Let's try to breakdown the procedures of:
- Upgrading an existing table for a running application
- Creating a new table for a new application
- Altering the configuration of a table for a running application
I fixed the configuration of cpjobqueue in deployment-prep, restarted the service, and verified requests are not getting through to the jobrunner:
FWIW, I don't think we need the TLS configuration in beta. I can try to simplify things. Sorry for not noticing this bug earlier, but adding Operations or better serviceops could've helped it coming to my attention.
Apr 18 2019
@Elitre the problem seems to be a regression in MediaWiki, tracked at T221368. @Reedy reverted the group 1 wikis (including meta) to the previous version and now the messages can be re-sent. I don't think there should be duplicates, you can quote me on that and blame me for the annoyance :)
We decided to revert given the spike in errors we got started yesterday at 19:15 UTC, and it corresponds to the SAL entry for moving group 1 to wmf.1.
@Reedy graciously reverted group 1 for me, as this was the cause for a UBN! ticket.
So the CA public cert will expire as well at the end of May.