Fri, Sep 20
and yes, if you need to have a service not start when the package is installed, you need a systemd::mask definition in puppet.
@hashar I guess the CI servers should have more relaxed thresholds? Is it even possible to configure gerrit to whitelist some host?
Thu, Sep 19
Just to clarify - this wasn't an attempt to imagine a future, perfect system, but just a small MVP that could possibly work on our current infrastructure.
One important detail:
Wed, Sep 18
just to note that when we remove the HHVM tag we might also want to remove the PHP7 one given that latter migration is almost over as well :P
Tue, Sep 17
@ori I'm not 100% sure I got what information you think would be useful to extract. At first glance it would seem like collecting those data in a structured manner on logstash would be useful, but the ticket seems to suggest to build a specialized interface.
Mon, Sep 16
I see no further occurrences of the bug in logstash either for mw1347 - not that I had many doubts at this point.
Sun, Sep 15
As can be seen on logstash, as @Daimona mentioned, errors suddenly stopped after I disabled the interned strings buffer. While it's early to evaluate fully any impact on performance, I would say that the effect is clearly not very penalizing. See the 95th percentile for successful responses for mw1348.
Fri, Sep 13
We have a first core dump on mw1348 - I moved it under /root/T232613, if you need access ping me on IRC.
It's not true that "important services are monitored via dedicated service specific checks", quite the contrary on a lot of systems, I would rather improve the systemd alert instead of silencing it, and maybe be finally done with using those hacky checks for the number of running processes.
Thu, Sep 12
mw1347 and mw1348 receive more traffic than the rest of the php api servers, so it makes sense this happens more frequently there.
Hi, we had some connectivity issues earlier. As soon as we were alerted and started checking, the issues recovered. We suspect the root cause to be a network maintenance ongoing at the time, but the problem is now resolved.
Wed, Sep 11
This is now 99% done. We just need a confctl release to be able to make scap pull work as intended. Resolving this task though, as the work on the mw deployment side has been completed.
I'm not sure about php7 - it was fixed when I got around fixing this machine.
While I support the use of this patch, the problem you're seeing should be greatly mitigated when we start using a middleware to manage service-to-service RPC. For now that's still in its infancy, but we already use that approach for cirrussearch, where requests are proxied via a local nginx on each appserver.
hi @ssastry just a clarification: how would we load the parsoid code, if it can't be merged in the wmf vendor repository? Same way we do on scandium?
Sat, Sep 7
Fri, Sep 6
Thu, Sep 5
So the real issue was:
- termbox correctly uses the api-ro.discovery.wmnet host
- the discovery record was incorrectly set to active-active
- so requests from termbox would just go to the nearest dc, meaning that in codfw it would face super-cold caches after every release
- as a consequence, some requests would time out because of the cold caches at all levels
I did some tests, and we still have one problem with scap pull:
Wed, Sep 4
@Papaul that looks fine - I don't think we need to swap out the SSDs, so just do it if we have a better use of those disks (they're pretty useless on an appserver).
@thcipriani should we create a new package/release?
Tue, Sep 3
I second what @MoritzMuehlenhoff suggested. The system is not scheduled for replacement for another 2 years, so if we can salvage it somehow, that'd be great.
Mon, Sep 2
A couple comments:
- I concur with @Volans I'd keep the first iteration (at least) *very* simple
- I know adding tags to a schema is a pain (in fact, it will need a data migration) but the flavour thing you were proposing seems like the kind of thing that should be a tag, so scope=eqiad,flavour=main could be a set of tags for an instance object
Tue, Aug 27
Mon, Aug 26
For context, the actual time to run the tests for operations/puppet is under one minute for most patches.
I did some cleanup removing non-sres and adding a few people from the US TZ.
@Mathew.onipe can I ask further details on the error you get? It should definitely not be an issue if the test works in a docker image locally.
@leila I just stumbled upon this task, and besides being happy that patch was merged I'm asking myself:
A file of 473 MB surely goes over the large file limits unless something changed recently.
BTW I see the patch is still under review, and @Volans is on it.
Aug 23 2019
First smoking gun is in all the intervals I controlled the offender was parsoid-batch with quite large requests. I'm trying to gathering more cases to create a better statistics.
I think it's good to have a first, simple implementation, like the one above, but I think going further we would need a "block" object in puppet (or elsewhere, more on that below) that includes:
Aug 22 2019
@WDoranWMF we will get to this as soon as our resources allow it.