Wed, May 12
Calico components are running with resource definitions in all clusters now.
Tue, May 11
Merged and deployed.
Mon, May 10
I've looked into typha and kube-controllers component as well as they shot a similar patterns (different magnitude, though).
Unfortunately we lack prometheus metrics for kube-controlles (as they are not available in calico 3.17). Typhas throttling seems mostly related to Go GC and a bunch of go routines pinging each other regularly. I'd assume the same for kube-controllers but I can't be sure currently.
Fri, May 7
I tried to verify the above assumption by collecting metrics more frequently (per second) from the docker API (see P15857). This paints a more clear picture of that happens:
Thu, May 6
Tue, May 4
Mon, May 3
Fri, Apr 30
Thu, Apr 29
Wed, Apr 28
DNS SRV records, pybal's and confd instances in codfw, eqsin, ulsfo moved to the new cluster. navtiming.service on webperf needed a restart as well.
Tue, Apr 27
Zookeeper has completely moved from conf200[1-3] to conf200[4-6], kafka-main, mirror-maker and kafka-logging in codfw have been restarted to catch up with that as well.
Mon, Apr 26
Switched zookeeper from conf2001 to conf2004.
We decided to leave it like this for today and see if anything comes up.
Fri, Apr 23
During testing today, we had some sideline issues because calico-node was dying (as we brought down the network interface on one k8s node for testing). This led to two action items/nice to haves:
- Mark nodes at not NotReady when critical Daemonsets are not ready (like calico-node)
- Unfortunately this is nothing we can do with builtin methods. The discussions about this (Node Readiness Gates) seem to always end with the recommendation to start all nodes with a specific taint that is then removed by the mandatory daemonset or a additional controller (like https://github.com/wish/nodetaint)
- Run calico-node without exponential backoff on Crashloop. This removes potential long wait times until a node comes back up when calico-node has failed a couple of times (e.g. the exponential backoff time is quite high)
Apr 16 2021
The tlsproxy currently serves a certificate not valid for conf200[4,5,6] (Prometheus errors with: Get https://conf2004:4001/metrics: x509: certificate is valid for conf2001.codfw.wmnet, conf2002.codfw.wmnet, conf2003.codfw.wmnet, conf2001, conf2002, conf2003, etcd.codfw.wmnet, not conf2004)
etcd cluster is set up now on conf200[4,5,6] although I had some trouble setting it up and I do not yet know why:
Apr 15 2021
Apr 14 2021
This is not exactly looking great on the staging clusters as we can see heavy throttling. The current assumption is that this is caused by the very spiky nature of the work done by the processes here and that we don't see that properly reflected in the prometheus metrics (as the scrape interval is 60s, it's likely that we "miss" the spikes).
I do see that using the configmap election method is appealing as it is build in and does not require additional software to function. Unfortunately I was not able to understand (by briefly reading the docs) if this uses a separate configmap or the one that is actually used for configuring flink.
While the former would be okay-ish I guess, the latter will potentially cause problems as every deployment will result in a re-creation of said configmap by helm. Resetting it to whatever state the chart has defined.
Apart from potentially losing data in that case I'm not 100% certain that helm will handle that properly in every case as I have seen to many weird issues with helm and "manually" altered kubernetes objects.
What I think what happened here is:
- The CronJob (with the old image set in spec) created a Job "linkrecommendation-production-load-datasets-1618311600"
- That Job created a Pod (linkrecommendation-production-load-datasets-1618311600-hn6k8) - old image, ofc.
- That Pod was deleted by @Dzahn
- The Job (watching over the Pod) could not find the Pod and sheduled a new one
- The CronJob otoh would not schedule a new Job, as concurrency is 1 and there already was an "active" Job.
Apr 13 2021
== Initializing == Traceback (most recent call last): File "load-datasets.py", line 370, in <module> main() File "load-datasets.py", line 170, in main dataset_name_for_table="checksum", connection=mysql_connection File "load-datasets.py", line 129, in ensure_table_exists create_tables(raw_args=table_args, mysql_connection=connection) File "/srv/app/create_tables.py", line 46, in create_tables cursor.execute(database_utf_alter_query) File "/opt/lib/python/site-packages/MySQLdb/cursors.py", line 206, in execute res = self._query(query) File "/opt/lib/python/site-packages/MySQLdb/cursors.py", line 319, in _query db.query(q) File "/opt/lib/python/site-packages/MySQLdb/connections.py", line 259, in query _mysql.connection.query(self, query) MySQLdb._exceptions.OperationalError: (1044, "Access denied for user 'adminlinkrecommendation'@'10.64.0.135' to database 'mwaddlink'") [general] Ensuring checksum table exists...
Apr 8 2021
We should take the chance and refactor this a bit.
According to kube-apiserver -h we don't need to list the default admission controllers via --enable-admission-plugins anymore and, even worse, they won't get disabled when left out. From the help output:
Added some defaults based on the current maximum values (https://grafana-rw.wikimedia.org/d/2AfU0X_Mz/jayme-calico-resources?orgId=1)
This is probably something we can revisit when we've decided how we deal with multiple k8s versions in the future (T278329)
Users and template migrated to use the username as user ID and a YAML list of groups instead of "type".
Apr 7 2021
Apr 6 2021
Removing serviceops as we don't use packages/releases signed with that key
Mar 24 2021
The first two repos where already read-only with an archived description. I've done so for the third one as well.
It's safe to say we did this and we have tasks for follow ups (mostly from T277191)
Mar 23 2021
It's still an open question how we will inject the node IP into the mcrouter configuration. it would mean we'd need to pass the host IP as an env variable to the mcrouter container and somehow inject it into the configuration. The downside is we wouldn't be able to make use of mcrouter's ability to reload its configuration at runtime.
But wait, it's currently still not fully active and blocked by: T274262
This is active in all clusters now
Done and rolled out
All nodes running kernel 4.19 now
The goal is to pack 4 or even 5 pods in a single modern node.