Thu, Aug 15
@Cmjohnson don't spend more time on it, it is scheduled for replacement and the replacement should arrive August 21. We can live without this server for 2 weeks.
Wed, Aug 14
Tue, Aug 13
eventgate eqiad was depooled from 10:30UTC to 12:20 UTC, which matches the time where no updates were applied.
Mon, Aug 12
elastic[1032-1052].eqiad.wmnet,elastic[2025-2036].codfw.wmnet have been configured with set /system1/oemhp_power1 oemhp_powerreg=os. This will take effect after next rolling restart.
Thu, Aug 8
A few more comments after discussion with @elukey :
Tue, Aug 6
At the moment, we have a ferm rule to allow access to port 8888 from $DOMAIN_NETWORKS. I think this should be sufficient, but I'm always somewhat lost in our network.
Mon, Aug 5
Jul 19 2019
Rough back of the enveloppe calculation of the cost of staying on RAID1 is on T227755#5349525. Since it contains pricing, I'm keeping this on the procurement task that is private.
I've just updated the task description to make it clear that even if we move storage to RAID0, we'll keep the OS on RAID1 (same scheme used by elasticsearch servers).
Jul 18 2019
Jul 17 2019
Jul 15 2019
Oops, the 3 logs above about maps shoudl have been on T218097
Jul 12 2019
copy completed, updater has catched up
copy in progress from 2003 to 1004.
Jul 11 2019
ES6 migration has been completed for some time
I observe a pretty significant drop in CPU usage on elastic1052 (>50% to ~25%), so that looks good. I'll wait until Monday to apply to the whole cluster.
Jul 10 2019
And spot checking the cp* nodes that I see, they seem to be cache upload, which is the cache in front of Maps, not the cache in front of WDQS. This seems to point to Kartotherian not using X-Client-IP.
From what I see in Kibana:
This will not move forward until Q2 (October). We'll talk about it again at that time.
Any news on this? Can we do something to help this move forward?
Jul 9 2019
@thcipriani Thanks a lot for the detailed explanation!
Jul 8 2019
I ran into this issue again when deploying WDQS today. Some of the binaries were owned by the previous deployer. My workaround was to reset ownership to myself, but that's obviously not a step that I would like to de every time we switch the deployer.
Jul 5 2019
elastic2054 is down again.
Jul 4 2019
All nodes reimaged, we're good for the moment
Jul 2 2019
Jul 1 2019
Jun 28 2019
This is scheduled to be done in Q1, so we can get started. As a reminder, some preliminary estimate were done in T222104 (not sure what can / should be reused).
Jun 25 2019
prometheus blazegraph exporter updated, we should be good now.
We could define the GUI module in a profile and disable that profile as needed (-P !gui). Some ideas: https://stackoverflow.com/questions/13381179/using-profiles-to-control-which-maven-modules-are-built
Jun 24 2019
no further issues seen, let's get this closed.
Jun 17 2019
Mjolnir workload is to transfer updates to the elasticsearch cluster, which happen weekly. So it is expected to have no updates for part of the week. The revised check we deploy checks for a ratio of errors, but does not check for zero devisions.
Jun 11 2019
@Cmjohnson elastic1029 is shut down and downtimed in icinga, do whatever you need to do and restart whenever it is done.
@Cmjohnson any news on this? Do you need anything from our side?
Jun 7 2019
Jun 6 2019
Looking around at maps2002, I see an invalid apt source list (P8595) during late command:
Jun 4 2019
For context: The maps servers have 2x900GB + 2x1.5TB disks. We are at the moment using RAID10 across those disks, so we're wasting a bunch of space. We could do better by doing RAID1 on the same size disks and LVM across those.