Fri, Feb 14
Thu, Feb 13
The VM had a kvm_extra: -bios OVMF.fd in its configuration. That meant is used UEFI, not BIOS and hence the usual boot_order: network ganeti functionality wouldn't work as the boot order was stored in the UEFI "firmware" which ganeti doesn't have support for yet. The fix was
Note that we currently have such an alert (or at least something close to it).
Wed, Feb 12
I 've just conducted 2 separate tests on 2 selected mw hosts, one appserver and one apiserver. Those were mw1331, mw1348 respectively. The tests were (via mangling /etc/hosts)
Seems like the deploy did not fix it after all. Most (if not all) hosts alerted this morning. It's evident in the graphs as well
Tue, Feb 11
Updated the list of actions that have to be taken at the task description. Number #1 is done, we are doing now the helm chart. As soon as the chart is reviewed and is merged, the other 2 items are SRE deploys (and should be done fairly quickly). The define what a "safe" deploy is for changeprop is a pretty good question. I guess @Pchelolo might be able to help with that, there's already some discussion in T244387 about it. Last step is pretty obvious I think :-)
Adding @hnowlan so that he is aware and perhaps maybe help move this along.
The result of running the command below is at:
Mon, Feb 10
Adding security team as an FYI
Fri, Feb 7
There's nothing rushing us on this btw, feel free to proposed alternative maint windows.
Those 4 machines will have to be done one by one in order as @RobH points out. Overall, about an hour of advance notice should suffice, but let's do one each day ? I 'll add tentative maint windows (last 1 day each for your convenience) to the task
The capacity increase did not fix anything, neither did some efforts with increasing requests/limits more. In fact the sum of throttled times got a 50% increase, which adds more value to the hypothesis about CFS quota issues. A TL;DR is that all pods, regardless of the amount of work they do, got mildly throttled because of linux CFS schedulers accounting for every chunk of time allocated to a task, even if the task has yielded.
Let me add my own finding. Doing systemctl stop uwsgi-ores triggers the issue. It's during the stop phase that uwsgi workers go haywire on CPU and memory usage. systemctl start uwsgi-ores after that does cause any significant CPU or memory increase.
We are definitely better than what we used to be, but I am still not happy. I 'll increase the capacity as well, from 4 pods to 6 pods, that is by 50%
Limits have been increased to 2.5 cores. However the app is still mildly throttled . Given the limits is 1.5 times more than the current total usage, I am inclined to think this is a scheduler artifact. We 've seen it before with kask and there's a lot of talk about it. It's essentially a recap of 512ac999 of the linux kernel. Interestingly after the deploy, latencies dropped  by some 25ms
https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&from=1581018813182&to=1581025628155&var-dc=eqiad%20prometheus%2Fk8s&var-service=wikifeeds is a graph of wikifeeds during the outage yesterday. The CPU throttling is very aggressive there, meaning the service did not have adequate resources to serve the requests in time. That ended up depooling the pods one by one until none were left to serve the load. That trigged the obvious alerts upon which we investigated and resolve the issue by restarting all pods, as they were probably non salvageable in any decent amount of time. In fact, judging by the output of kubectl get pods some were occasionally repooled, only to be flooded with requests once more, rendered quickly unable to serve more traffic and again being depooled, leading to a self-sustaining downward spiral, out of which is was difficult to get.
Thu, Feb 6
I 've tried to reproduce this. It's easily reproducible after all. Just do what logrotate does and issue systemctl reload uwsgi-ores. CPU usage spikes and reaches 100% for all CPUs in the machines for several seconds. Memory usage spikes as well and then OOM killer shows up as the machine is out of memory. The best thing for OOM killer to kill is celery as this is the big memory user.
Wed, Feb 5
I just reverted the cr3, cr4 uslfo change.
I blocked a number of IPs manually on cr3 and cr4 for ulsfo. Command was
Important release notes for 1.13.x that affect us
T243451 does explain the higher memory usage. It even points out that the higher memory usage is worrisome, however it was deployed anyway.
Tue, Feb 4
eqiad and codfw graphs both point to similar issue as last time. 06:25UTC seems to be the start of the incident. Memory usage however has increased by close to 100% 9 hours before the event. The trigger is probably logrotate again (it anyway happens every day - if it was the cause we would see this all the time), but the cause is probably something in the traffic patterns.
Fri, Jan 24
@Dzahn, I 've merged the required remaining changes to get the migration done. Now etherpad.wikimedia.org uses etherpad1002. Checked a couple of pads, it seems everything is fine. Hopefully we have no corruption issues. etherpad1001 is now removed from site.pp and I 've removed the etherpad-lite debian package from it. I 've also -2ed the discovery record changes due to the issue above about the software not supporting scaling out. I guess what's left is to decomission and delete that VM.
Pad that per logs have been accessed on https://etherpad-new.wikimedia.org
I 've removed the DNS and stopped and masked the service for now on etherpad1002. Since we proved it works, let's just move over to etherpad1002.eqiad.wmnet, stopping beforehand etherpad1001 (to avoid the issues I alluded to). etherpad is anyway best effort, it's ok to even have an extended downtime.
Thu, Jan 23
Thanks. That's working now, but I've downloaded the log file and it's just what's already available on kibana, warn level or higher. There's no debug level or message (10/20) in the logs - I don't suppose we have those anywhere?
-l app=citoid as that's the value for the app label, not citoid-production.
I just noticed that for some reason setting DEBUG_LEVEL: 0 for zotero no longer works however.
it should be in the raw logs
Wed, Jan 22
Tue, Jan 21
Thanks for the ping. Notes:
But first, I think a big source of confusion in our patches is the conflation of the word 'service'.
I did rerun 2 times the num_workers=3 test. No big diff. 100 "locust users", spawned at a rate of 0.1/s. After about peaking at about 0.5 RPS, errors start happening and latency skyrockets at ~60s. CPU still is around 3K. Memory wise it has gone up to 4GB after about 1.5h of benchmarking. Funny thing is that the memory usage is not plateauing at all, but rather keep on increasing. This is I guess expected given we use chromium which is known for being a memory hog. Kubernetes will anyway take care of the memory leaks by restarting the pod if it goes over usage.
We resolved this live in a hangout with @Mvolz. Re-resolving
Mon, Jan 20
I 've rerun the benchmark against values of 1,2,3 for num_workers