Double checked across all nodes, this has been applied successfully. Resolving. Thanks!
show system1/log1 etc has 2 telling entries
I 'll lower priority for this. We may have the solution. https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=87&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=ores&var-instance=All&from=now-7d&to=now lacks the pattern on the 30th and the 1st, whereas it is clearly visible on all the days before that. I 'll leave it be for a couple more days though so we can monitor.
Mon, Jun 29
My bet is that when you shutdown the workers, the PyObject c structure of each python object is 'touched' forcing the 'copy part' of the COW behaviour.
Fri, Jun 26
Everything seems fine. I am gonna resolve this. @Pchelolo thanks again!
Both paths outlined in the description have been followed. We now have 4 different sessionstore dedicated nodes in each DC, with capacity of 6 kask pods each for a total of 24pods. That is 6 times larger than the capacity we had when the incident was triggered and 3 times larger than the capacity we now have allocated. So we have room to react in a sudden increase, hopefully in the future even automatically.
VMs created, installed and ready. Resolving
Thu, Jun 25
It seems like the restbase deploy of 821e96b fixed the issue. I 'll lower priority but leave this open for a few hours and if everything is fine, close as Resolved
@Jclark-ctr Excellent. I started the process of emptying ganeti1006 (and filling ganeti1005), that should take quite a while, but we should be on time for next Thursday. Many thanks!
Couple of more benefits of k8s I forgot to mention yesterday
@Jclark-ctr: ganeti1005 is ready. Fully depooled, downtimed and powered off.
@Pchelolo I think that yesterday's deploy of restbase has caused the issue described in this task. Restbase also alerts with
The start of the increases coincides with a deploy of restbase
codfw wikifeeds is by the way exhibiting similar behavior. https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?panelId=12&fullscreen&orgId=1&from=now-24h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=wikifeeds
Turnillo points out that these requests come disproportionately with this User Agent
Wed, Jun 24
An interesting thing to note here is that some services have quite often pod restarts. e.g.
There are no oresrdb nodes at all anymore. ORES redis has been moved to the redis::misc cluster. This is probably no longer relevant. I 'll remove the vm-requests tag, feel free to readd.
Currently, sessionstore sets a limit of 400Mi and 2.5 CPUs. Memory wise, the nodes have 4GB RAM and 6 CPUs. The easy win here is to increase the amount of the vCPUs from 6 to 15. That should allow for a 200% increase in the amount of pods the node can run (from 2 to 6). The amount of memory available supports already 10 pods in theory, in practice since the node also consumes some memory it's more like ~8, which makes the 2 resources pretty close to each other.
Tue, Jun 23
@Jclark-ctr: OK, how about 1 host per week? no need for specific timeframes. I 'll have the host depooled, emptied, powered off and downtimed in icinga and ready for the memory upgrade. All you 'll need is to add the memory and power up.
With the last 2 changes merged, the work on this has been completed. Many thanks @apakhomov !
Mon, Jun 22
@JMeybohm @Pchelolo @hnowlan: This is an account of the investigation we went through for the changeprop memory/cpu limiting.
It's part of the sessionstore incident actionables, although it's after all largely unrelated. Please proof read and correct any errors I might have made.
Fri, Jun 19
For what is worth this is probably going to be scheduled for next quarter (so July-September).
Thu, Jun 18
Wed, Jun 17
Great! Thanks Papaul
Service owner steps done, DC ops work can begin.
Service ops owner steps done, machines are ready to be handled by dc ops.
Tue, Jun 16
Mon, Jun 15
Sun, Jun 14
kubernetes2007 has been reimage successfully, it seems like kubernetes2008 to kubernetes2014 require networking configuration on the switch side.
Fri, Jun 12
Yes, absolutely agreed. The trigger was indeed insufficient capacity in sessionstore to handle a, at least, 33% (~15k to ~20k if not more) sudden increase in requests. We 've gone ahead and added capacity to the service and will follow up with adding more capacity to the entire cluster as well as the dedicated sessionstore nodes. So, I 'll be bold and resolve this, feel free to reopen though.