Wed, Oct 2
Host is shutdown and has networking issues. As such the role(system::spare) was not applied and it is still in puppetdb. Since that cleanup step is part of the "uninterruptible" steps, it has not been performed yet.
Tue, Oct 1
The MIME types used are:
Mon, Sep 30
Mon, Sep 23
Steps for decommission of elastic1017:
Thu, Sep 19
The errors themselves are expected behaviour. Subtasks have been created to track improvements we can do.
Wed, Sep 18
Tried again, seems that sda has issues (see log below). Is it that the second disk also failed? Or that the wrong disk was replaced? Or something else?
Sep 12 2019
It's probably good to keep that task, but the current deployment on beta is really a hacked together proof of concept, and we do expect a lot of things to go wrong. We need to do a lot of cleanup (T232297) first.
Note that T222497 needs to be resolved before we can actually have a working dump.
As discussed on IRC, I can vouch for @Igorkim78, he is a contractor with us and already has access to the wikidata-query project on WMCS. I've granted access to deplyoment-prep.
Sep 5 2019
Sep 4 2019
password reset completed. The password configuration is duplicated for master / slaves and was not updated for slaves. This should be back to normal. @MSantos : can you confirm and close if all looks good?
Sep 3 2019
Sep 1 2019
As far as I can tell, the issue is now resolved.
Aug 30 2019
OSM replication was triggered manually and tile generation started for that area (follow the generation on grafana). If that's enough, we'll just need to invalidate cache for that area, or wait 24h for cache invalidation. If that did not fix the issue, we'll need more investigation. Either we have corrupted data in some way (on both eqiad and codfw, seems unlikely) or we have a style issue.
Aug 29 2019
Aug 27 2019
Needs to be documented on https://phabricator.wikimedia.org/project/view/1227/
@Cmjohnson: it looks like the installer only sees a single disk, and thus can't partition. Could you check? Thanks!
Aug 23 2019
@Cmjohnson thanks! I'll take it over and reimage
Aug 22 2019
Aug 20 2019
Nice! I did not know that one.
Aug 19 2019
Aug 15 2019
@Cmjohnson don't spend more time on it, it is scheduled for replacement and the replacement should arrive August 21. We can live without this server for 2 weeks.
Aug 14 2019
Aug 13 2019
eventgate eqiad was depooled from 10:30UTC to 12:20 UTC, which matches the time where no updates were applied.
Aug 12 2019
elastic[1032-1052].eqiad.wmnet,elastic[2025-2036].codfw.wmnet have been configured with set /system1/oemhp_power1 oemhp_powerreg=os. This will take effect after next rolling restart.
Aug 8 2019
A few more comments after discussion with @elukey :
Aug 6 2019
At the moment, we have a ferm rule to allow access to port 8888 from $DOMAIN_NETWORKS. I think this should be sufficient, but I'm always somewhat lost in our network.
Aug 5 2019
Jul 19 2019
Rough back of the enveloppe calculation of the cost of staying on RAID1 is on T227755#5349525. Since it contains pricing, I'm keeping this on the procurement task that is private.
I've just updated the task description to make it clear that even if we move storage to RAID0, we'll keep the OS on RAID1 (same scheme used by elasticsearch servers).