Tue, Jun 25
prometheus blazegraph exporter updated, we should be good now.
We could define the GUI module in a profile and disable that profile as needed (-P !gui). Some ideas: https://stackoverflow.com/questions/13381179/using-profiles-to-control-which-maven-modules-are-built
Mon, Jun 24
no further issues seen, let's get this closed.
Mon, Jun 17
Mjolnir workload is to transfer updates to the elasticsearch cluster, which happen weekly. So it is expected to have no updates for part of the week. The revised check we deploy checks for a ratio of errors, but does not check for zero devisions.
Tue, Jun 11
@Cmjohnson elastic1029 is shut down and downtimed in icinga, do whatever you need to do and restart whenever it is done.
@Cmjohnson any news on this? Do you need anything from our side?
Fri, Jun 7
Thu, Jun 6
Looking around at maps2002, I see an invalid apt source list (P8595) during late command:
Tue, Jun 4
For context: The maps servers have 2x900GB + 2x1.5TB disks. We are at the moment using RAID10 across those disks, so we're wasting a bunch of space. We could do better by doing RAID1 on the same size disks and LVM across those.
Mon, Jun 3
Tue, May 28
May 27 2019
For whatever reason, only maps1004 was reimaged to RAID10 (instead of RAID1) when adding new disks (so we have 2 unused disks in each server). Note that since we have disks of different sizes, RAID10 is still wasting quite a bit of space, we should probably have RAID1 over the physical disks and use LVM to spread the partition over those 2 RAID1.
Previous instance of a similar problem: T194966
May 22 2019
May 21 2019
Duplicate of T223519
May 20 2019
Error reseted as documented in Monitoring/Memory.
May 7 2019
May 6 2019
Executive summary: we should have enough capacity for next year.
May 3 2019
Actually, the check timed out. Which make sense if it was routed to the problematic server, before it was marked invalid. This is expected, so nothing to do here.
It looks like we need to investigate this a bit more
May 2 2019
The use cas being run currently is actually the cirrus dumps to initialize cloudelastic servers. They are downloaded on mwmaint1002 with curl -s https://dumps.wikimedia.org/other/cirrussearch/20190429/enwiki-20190429-cirrussearch-general.json.gz
May 1 2019
Apr 30 2019
After a few days, the load looks good and smoother than before. Let's close this!
Apr 26 2019
Apr 25 2019
Apr 23 2019
Yes, this is the continuation of T206636.
Apr 18 2019
I had a conversation with @hashar about this topic. So here are a few idea:
Data transfer completed with the new cookbook, everything seems fine.
Apr 16 2019
Stretch migration is completed. This should be fixed, we'll reopen if this happens again.
redundant units have been cleaned via cumin:
Apr 15 2019
Deployment seems to be a noop:
permissions reset via:
Removing maps from this ticket, since there isn't any work left on our side.
Apr 12 2019
Apr 11 2019
I don't think there is anything actionable at this point. Let's close.
Apr 10 2019
Open firewall on cloudelsatic machines to allow connections from mwmaint*, mw job runners and cloudelastic
Apr 9 2019
Reimage was problematic, with first a puppet failure and then the server not booting over PXE. Manually booting in PXE (F12) finally fixed the issue.