Fri, Apr 19
Change merged, thanks!
This is almost done. That only thing missing seems to be the peering with the juniper routers.
Before we move forward and enable this, let's make sure we have understood the security repercussions and have mitigated them (and if we find it impossible to do so, avoid it).
Thu, Apr 18
Just noting that at 10:41 UTC the circuit was still down per
Wed, Apr 17
Mon, Apr 15
Fri, Apr 12
Change merged and shepherded into production. I am lowering priority but not resolving as we probably want to evaluate this more
I am thinking about excluding exec_sync operations for a while from the checks to restore faith in the alerts.
A breakdown of the alerts per host follows starting from 2019-03-26 to 2019-04-12 follows
Today (2019-04-12), I 've raised the possibility that T220661 is related to the reason these alerts are flapping so much.
FYI, note the pod restarts below. Seems like the worker isn't ready in the timespan of the kubelet liveness probes (3x 10s => 30s) and eventually the pod is killed. Does it really take that long to initialize a new worker? We can definitely tune those numbers but 30s for initalizing a worker sounds like a lot.
Wed, Apr 10
dummy private repo updated, so is the actual private repo. Resolving, thanks!
Tue, Apr 9
FWIW the ganeti cluster uses exactly the approach outlined by @BBlack for this (among other, even more important) reasons:
Firejail seems a nice extra for robustness, but @akosiaris seemed to suggest in T217724#5008302 that that might introduce some overhead. AIUI the sandbox is discarded anyway after every PDF render, to reduce the likelihood of problems caused by state leaking over to the next render, but I might be misunderstanding.
Mon, Apr 8
@WMDE-leszek Hi, sorry for not answering any sooner, last few weeks have been crazy indeed.
Stalling until we have some sane solution.
Sat, Apr 6
Fri, Apr 5
Nice! Thanks so much!
Thu, Apr 4
I can say I haven't in a pretty long time. If @faidon also doesn't I think we can shut it down.
Currently the /etc/apt/sources.list for the pbuilder base images are missing entries for the security suites. Theses files should be updated and managed by puppet.
Wed, Apr 3
Just some notes, I agree on premise with most of the above.
Tue, Apr 2
@bd808, I 've submitted https://gerrit.wikimedia.org/r/500681 for review and updated https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/497866 to match.
One minor correction
Mon, Apr 1
Now that I 've automated the tests it was easy to combine the 2 approaches and fix the drawbacks. User5 in P8321 is subject to both approaches and the only drawback I could find out is a race of 1s between an admin resets the password (by mistake, or following some old and outdated process) and the user managing to authenticate to a service. This is an attack vector that requires 2 humans to perform an action and thus not something that can be somehow automated. I think we are safe from this one.
In the interest of fully figuring this out I 've updated my testing openldap vagrant env  with a Makefile that automates and runs some tests. I 've pasted results for it in P8321. The test suite can (and maybe should) be extended to make sure we got all our bases covered.
Sat, Mar 30
Thu, Mar 28
@Gilles, I am a bit unclear as to what remains to be done for this. Could you shed some light?
Copying from https://gerrit.wikimedia.org/r/497684 (and adding some extra stuff)
Wed, Mar 27
Tue, Mar 26
Resolving, feel free to reopen
Mar 21 2019
Incident report updated. The only actionable followup task is T122676
Mar 20 2019
Mar 19 2019
Well I have my reservations for sure. As I said we are talking about a service-checker run every 10s (tunable, but it's a sensible default). While tunable, the command should not take long. The timeout is tunable but the default is 1s. I think it's a sensible value for a web service and I don't think service-checker doing all the endpoints + a POST that will end up in kafka is going to be THAT fast. It also requires having service-checker on every node out there. While fine in production, the development environments is definitely not gonna have it. We could have the probe in values.yaml (we do for all other services, only eventgate is an exception) and override it in production and that's probably fine, but it should be documented.
The readiness probe can't really be POST. The ref is here https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#probe-v1-core, it only allows httpGet, tcpSocket and exec. Exec could be used for this and call the service-checker but it adds a dependency on having service-checker on the nodes (it isn't there currently) and instrumentation (getting the pod IP essentially, I am not sure it's exposed there, will need to check) to make it happen. It might also be a tad heavy to run service-checker every 10 secs for every pod.
Mar 17 2019
I think the intention is to (somewhat ?) limit the impact the vandal is trying to achieve (at least by removing the capability to link to those comments). While it's, as you point out due to the email notifications, impossible to fully mitigate that, and potentially causing other issues, I still consider it a prudent course of action. Other than that, please wait for a formal announcement (one that a link to will be posted to this task as well).