Tue, Jun 11
Just as an FYI, everything looks ok on this end, but there's a train freeze this week, so we have to wait before deploying this. Patches are up and waiting to be merged on Monday the 17th
Fri, Jun 7
sessionstore.discovery.wmnet is now around and should be the canonical DNS used to address the service.
Indeed this was fixed. However another regression has crept up it's head
Thu, Jun 6
Wed, Jun 5
Tue, Jun 4
And LVS done today.
Patch has been applied to our own packages and has been deployed and tested. Marking this as resolved, thanks!
Fix by upstream in https://github.com/OTRS/otrs/commit/7ab33e51a4db9f712e979040f644d0d0c39ff0af for 5.x (which we run). Has also been fixed in our package for OTRS in https://gerrit.wikimedia.org/r/#/c/operations/software/otrs/+/514230
Mon, Jun 3
Sat, Jun 1
Fri, May 31
One minor question. Given per T220401#5128786 1 kask instance is able to handle ~300req/s, how many instances will we require? I am unsure of the current rate of sessions requests to/from redis.
I think so, let's wait for @fsero though
And this uncovered now that prometheus can't talk to it (cause it expects HTTP I guess?). /me looking into it (more deeply this time around).
One thing that I just met is that kask stops accepting HTTP connections if kask cert/key pair is configured. That's fine normally, but there is a very interesting repercussion. Kubernetes readiness probes to the /healthz endpoint now fail. kask logs
We might be able to get away with reusing the redis misc servers (rdb1005/rdb1009). That should give us more memory and allow us to use the empty databases on those hosts and deprecate/remove these VMs
Thu, May 30
Indeed. I added it under TBD. The exact way this will be done will need to be investigated.
Wed, May 29
https://wikitech.wikimedia.org/wiki/Ganeti#VMs_without_DRBD_disk_template has been added to address the drawback needing to be communicated and documented.
Tue, May 28
Hidden and marked as security by upstream.
Upstream bug at https://bugs.otrs.org/show_bug.cgi?id=14568
I just reproduced it. This is related to a sandboxed <iframe> that is embedded in the page to showcase the content of the email in a safe way. That being said, leaking session info in the url is indeed bad practice, due to all the copy pasting that can happen. In fact, just adding a valid OTRSAgentInterface to any OTRS url seems to be enough to allow assuming the OTRS identity of a user. On the plus side, those aren't guessable at least. The fun part is that
We 've upgraded to 5.0.35. Resolving this. Thanks!
Mon, May 27
That would split even worse though the configuration. Instead of having it at least in different but related parts of the infrastructure, now it's across different infrastructures (mediawiki vs email).
Wed, May 22
I am lowering to High, just in the interest of not abusing Unbreak Now!, since this task has been in this state since Apr 23. That being said, this indeed needs to be resolved ASAP.
Sun, May 19
May 17 2019
May 16 2019
needs an extra stanza
Every deployment that uses statsd-exporter (namely zotero & blubberoid don't) in kubernetes has been upgraded. Resolving this. Many thanks!
May 15 2019
May 14 2019
With respect to the end point checks it would be great to hear what we are trying to achieve with them. Our service depends on the availability of another service. If the examples are to act as smoke tests then their reliability depends on the upstream service; a dependency which would need to be configured (are we going to point it against prod for this?) & modeled (how to express service inter-dependency in the config?) in order to be able to make sense of the information down the line (i.e. "no need to be alarmed that this service reported 500 while the mw api was down").
May 13 2019
Bacula & puppet databases are not going to exhibit any problems anyway. Puppet is literally used only by servermon and this is to be uninstalled pretty soon and backups don't happen during that timewindow.
etherpad, given the software, is a best-effort service, so no guarantees there. it will probably crash anyway, be restarted by systemd (as it anyway does every couple of days), users will be reconnected.
https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/377269/ had fallen through the cracks. It's now merged, right before a SWAT window, in order to identify issues as fast as possible