Fri, Sep 13
We have a first core dump on mw1348 - I moved it under /root/T232613, if you need access ping me on IRC.
It's not true that "important services are monitored via dedicated service specific checks", quite the contrary on a lot of systems, I would rather improve the systemd alert instead of silencing it, and maybe be finally done with using those hacky checks for the number of running processes.
Thu, Sep 12
mw1347 and mw1348 receive more traffic than the rest of the php api servers, so it makes sense this happens more frequently there.
Hi, we had some connectivity issues earlier. As soon as we were alerted and started checking, the issues recovered. We suspect the root cause to be a network maintenance ongoing at the time, but the problem is now resolved.
Wed, Sep 11
This is now 99% done. We just need a confctl release to be able to make scap pull work as intended. Resolving this task though, as the work on the mw deployment side has been completed.
I'm not sure about php7 - it was fixed when I got around fixing this machine.
While I support the use of this patch, the problem you're seeing should be greatly mitigated when we start using a middleware to manage service-to-service RPC. For now that's still in its infancy, but we already use that approach for cirrussearch, where requests are proxied via a local nginx on each appserver.
hi @ssastry just a clarification: how would we load the parsoid code, if it can't be merged in the wmf vendor repository? Same way we do on scandium?
Sat, Sep 7
Fri, Sep 6
Thu, Sep 5
So the real issue was:
- termbox correctly uses the api-ro.discovery.wmnet host
- the discovery record was incorrectly set to active-active
- so requests from termbox would just go to the nearest dc, meaning that in codfw it would face super-cold caches after every release
- as a consequence, some requests would time out because of the cold caches at all levels
I did some tests, and we still have one problem with scap pull:
Wed, Sep 4
@Papaul that looks fine - I don't think we need to swap out the SSDs, so just do it if we have a better use of those disks (they're pretty useless on an appserver).
@thcipriani should we create a new package/release?
Tue, Sep 3
I second what @MoritzMuehlenhoff suggested. The system is not scheduled for replacement for another 2 years, so if we can salvage it somehow, that'd be great.
Mon, Sep 2
A couple comments:
- I concur with @Volans I'd keep the first iteration (at least) *very* simple
- I know adding tags to a schema is a pain (in fact, it will need a data migration) but the flavour thing you were proposing seems like the kind of thing that should be a tag, so scope=eqiad,flavour=main could be a set of tags for an instance object
Tue, Aug 27
Mon, Aug 26
For context, the actual time to run the tests for operations/puppet is under one minute for most patches.
I did some cleanup removing non-sres and adding a few people from the US TZ.
@Mathew.onipe can I ask further details on the error you get? It should definitely not be an issue if the test works in a docker image locally.
@leila I just stumbled upon this task, and besides being happy that patch was merged I'm asking myself:
A file of 473 MB surely goes over the large file limits unless something changed recently.
BTW I see the patch is still under review, and @Volans is on it.
Fri, Aug 23
First smoking gun is in all the intervals I controlled the offender was parsoid-batch with quite large requests. I'm trying to gathering more cases to create a better statistics.
I think it's good to have a first, simple implementation, like the one above, but I think going further we would need a "block" object in puppet (or elsewhere, more on that below) that includes:
Thu, Aug 22
@WDoranWMF we will get to this as soon as our resources allow it.
I think this is a reasonable explanation, but how would you suggest we should fix our monitoring?
it's indeed strange. In particular I find it strange that it affects mainly 400s and 404s. Maybe the Performance-Team might have an insight to why 4xx and 301s are so slow after a deploy of the train (where I assume a ton of caches, local and not, are invalidated all at once)
@Reedy @JBennett I've set you up as list administrators. You now need to change the list admin password, I'm happy to help if you can't reset it yourself (I don't think you can). Just ping me on IRC so we can change the password in sync.
This seems to be the consequence of yesterday's upgrade of phabricator. SRE is trying to reach anyone in release engineering to try to help with the rollback.
Wed, Aug 21
Indeed! we're doing more than this!
Tue, Aug 20
You can automate the process with a simple alias:
Mon, Aug 19
Sorry, the indications you give here are in contrast with each other:
Aug 14 2019
T214984 seems somewhat related.
Aug 7 2019
My main worry is that anything you could do would be wiped out by the next scap run, unless we find a way to inject the code into mediawiki in a way that does avoid that.
I like the idea of having the ParserCache being a more generalized caching mechanism for MediaWiki. I have serious doubts about other things hinted here, specifically exposing a caching endpoint to other services. I'd argue that such a caching service should be separated from MediaWiki, have a simple API, and probably be structured around the page/revision identifier. We also probably don't want such a system to be written in PHP, as we would aim for the highest possible throughput.
Aug 2 2019
I would rather do what @Anomie suggested, that is using PrivateTmp=true for php-fpm. I'll look into it.
Aug 1 2019
I tested reimaging one application server and it went flawlessly and it's now running without any trace of HHVM. I'll resolve this ticket.
Jul 29 2019
@Aklapper the RfC has been edited to reflect what's on phabricator at https://www.mediawiki.org/wiki/Requests_for_comment/Standards_for_external_services
I think this was an idea before the thumbor project started, in order to have a thumbnail service that didn't need to have access to the databases, for instance.
Jul 25 2019
Given parsoid/PHP is intended to be used as a library by a MediaWiki extension, I think it should be included in the code we release with scap, and the extension be activated only on scandium for the time being.