Thu, Jul 12
Change deployed, apertium-fra-cat, apertium-separable is updated and apertium-apy has been rolling restarted
Wed, Jul 11
grafana-admin.wikimedia.org now redirects to grafana.wikimedia.org (preserving the url structure) in order to migrate the last few users to it.
Mon, Jul 9
FWIW, servermon allows searching for specific facts (for specific hosts as well) under the fact query menu allowing to relatively easily slice information present in puppetdb (actually in mysql activerecord db, but this is unimportant), which is not something I see possible in puppetboard. There is a query tab but I get
Fri, Jul 6
Key updated, should make it to the cluster in the next 30 mins. I am resolving this, feel free to reopen/reach out if there's a problem.
I am setting this to Stalled and a low priority, pending @Braveheart 's response.
I am adding the cloud services team for the 2fa reset and removing the SRE-access-request tag per @Krenair 's comment above.
Thu, Jul 5
FWIW, +1 from me.
Tue, Jul 3
Mon, Jul 2
This has been completed successfully a few weeks ago.
Fri, Jun 29
Wed, Jun 27
Things are definitely going way better now. I only see 1 alert in the last 24 hours.
Mon, Jun 25
Lowering priority to depict we currently have upgraded quite a bit the CPU count but the task is not yet resolved.
Agreed. While overall the proposed solution is probably the best, I went ahead with option A (increase the vCPU count to 10) alone for now in order to facilitate moving forward with this without blocking it on adding more machines to the cluster. Overall this gives us a total of 20 worker count per DC which is pretty close to the proposed one and I am hopeful this will resolve the issues experienced. I 'll however the task open in order to later on implement the proposed approach
Jun 16 2018
Jun 15 2018
I can probably indeed help. I presently have no idea what's up with labs1006, but labsdb1007 is the alias osmdb.eqiad.wmnet which is directly used by the maps labs project. I 've added @Kolossos and @dschwen as 2 of the people I know use this infrastructure. As far as I know, a lot of map tiles are pregerenated and at least some functionality will not be directly impacted by a downtime of the service. I am willing to be there will be functionality that will be impacted however, but @Kolossos and @dschwen can probably shed more light into this.
Jun 13 2018
Jun 12 2018
I 've stalled adding LVS configuration for proton due to an instability we've been noticing. This instability is very obvious if one looks into the alert history of every one of the 4 services (there are 4 hosts powering the service right now)
Yeah, found something new, I 've reblocked some stuff, I 'll update P7249. Things do look normal again, this might just be a whack-a-mole game though
P7249 for the list of IPs
I 've banned a specific IP (I 'll share it in a private paste later on), restarted apache and everything seems to be ok now
Jun 11 2018
All of my tests went fine. Scheduling this for Wednesday June 27th. I 'll send an email to wikitech-l as well