User Details
- User Since
- Jan 5 2016, 9:54 PM (433 w, 3 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- LToscano (WMF) [ Global Accounts ]
Yesterday
Test in staging has been done, and it was successful! All the revscoring services are now running without WIKI_URL set explicitly.
Current status:
Wed, Apr 24
Tue, Apr 23
The only issue that I see from puppet is that prometheus::node_amd_rocm uses rocm smi to get info about what GPU to monitor.
Quick clarification - there are currently two places where we use ROCm-specific libs:
Also I confirm that AQS Cassandra runs now with PKI TLS certs, so we can start encrypting TLS connections anytime.
Filed a change for the stat nodes, the hadoop worker nodes already have the truststore!
Fri, Apr 19
After some tests I was able to find an istio configuration to support both transparent and non-transparent proxy settings (namely, setting WIKI_URL or similar to a discovery endpoint or just using http://en.wikipedia.org).
Next steps:
- Roll out PKI to aqs eqiad (codfw already done and it is running fine)
- Rollout the new Truststore to all Restbase nodes (prep-step to deploy PKI).
@JAllemandou Hi! I have a question for you when you have a moment :)
Thu, Apr 18
Wed, Apr 17
I've noticed some autoscaling, and high cpu usage in the kserve containers. I've raised the min/max replicas from 1/4 to 4/6, and with more capacity the latency is way better (at least for now).
We are still seeing high latency for ruwiki's damaging, the current theory is that some rev-ids are causing troubles in the preprocessing (feature extraction etc..) phase ending up into an inconsistent state, that affects the other requests as well. We'll deploy https://gerrit.wikimedia.org/r/1020898 as stop-gap to figure out what requests are causing this issue, and then we'll try to find a fix. For the moment users may experience troubles when calling lift wing, apologies in advance, but please keep reporting connectivity issues if you find them.
Saved ruwiki's pod logs to deploy1002:/home/elukey/T362503
Overall steps:
Today I found out https://github.com/istio/istio/issues/21914 after a lot of debugging in staging for T362316. The main issue that I was trying to solve was that after applying the new ServiceEntry/DestinationRule/VirtualService config for T362316, everything worked like a charm, but suddenly stopped if https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/984214 was reverted (manually deleting the extra ServiceEntry).
Added some thoughts to T353622#9723070, I found out a big can of worms while testing staging :) The upgrade is more complex than anticipated, but we should be able to do it this or next week maximum.
There are two kinds of istio metrics - the ones from the gateway and the ones from the sidecars (inbound and outbound). In theory it should be sufficient to check the Gateway metrics, since if a sidecar misbehaves it should be clearly visible from it, and it should reduce the volume of metrics pulled even further. The gateway metrics should be distinguishable from the rest via the kubernetes_namespace="istio-system".
Tue, Apr 16
Created two follow ups:
@Eevans sigh :( I found https://github.com/gocql/gocql/issues/1611 that may help, I didn't have time to check the code though.
Mon, Apr 15
I think that multiple requests caused a ton of time spent in preprocess(), causing the isvc to totally stall and get into a weird state (most probably revscoring ended up in a weird/not-working state).
Something strange: from the istio gateway logs, the HTTP response code logged is 0 https://logstash.wikimedia.org/goto/9003f0bd1a3c34e303ac5fbe86eff693
Fri, Apr 12
Current status:
Thu, Apr 11
aqs1010's instances are running with PKI TLS certs, so far everything looks good. I had a chat with Eric and Ben, we'll let it run until next week to catch issues and then we'll proceed with the rest of the cluster.
Very ignorant about the internals but the procedure seems sound! In the ML case, we could live without backfilling the previous quarters/SLO time series, the more pressing thing is to start from a clean state (without gaps etc..). Thanks for the work!
Tue, Apr 9
Thanks to Aiko that fixed some issues with RR Wikidata and ML, the new code is now deployed to all the model servers that used to have OMP_NUM_THREADS explicitly stated in deployment-charts. The model servers work fine and their performance is good.
I opened T362181 for Airflow/Spark clients, IIUC they don't currently use TLS so we should be good.
Mon, Apr 8
@MatthewVernon we could do something like the following:
Fri, Apr 5
Last step is to review/merge the puppet changes listed above, then we can close!
Created new /116 for AUX: https://netbox.wikimedia.org/ipam/prefixes/930/
I tried to check the Cassandra AQS' clients and how they trust/validate TLS certificates. IIUC all the clients are on k8s and using the cassandra-http-gateway as chart, that renders a config file like /etc/cassandra-http-gateway/config.yaml containing various info about how to connect to a Cassandra cluster, and among those I found:
Added a change (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013571/ to be able to keep the ca-manager's ca-bundle around, since currently it disappears from the puppet catalog (so the file stays on the node, but if we reimage it will not be re-created etc..).
The new option is meant only as a compromise to fully transition us to cfssl without too much pain :)
I have updated the docs for the renewal use case, I don't think that we need to change anything in the cert's manifest for this use case (renewal).
Thu, Apr 4
Rolled out Dragonfly to all ml clusters!
All subtasks completed, wrapping up the task, thanks to all for feedback/help/support! <3
We have created two base images, one for Pytorch 2.2.x and one for 2.1.x, they will be tested and used with Revert Risk ML and Hugging face's model server.
Hi Jeff!
Wed, Apr 3
The old certificates (cergen based) for Kafka Logging in deployment-prep expired and we worked to add a PKI-based TLS cert instead of the old one (self managed via puppet etc.., no need for any manual and periodical renew). All the puppet work was related to make PKI to work in deployment-prep, since there were several issues that caused failures. The puppet master rebuild caused some troubles when trying to apply private commits in deployment-prep's private repo. While at it, the deployment-prep's kafka jumbo cluster was moved to PKI as well to fix an issue with varnishkafka running on the deployment-prep's cache node (TLS cert expiry as well).
I think that all the work is done, varnishkafka seems working fine now and the logs are flowing nicely. Shall we close?
Tue, Apr 2
Ok so the issue was that the profile::pki::client::auth_key value set under hiera's profile settings was not picked up (thanks Taavi for the help) so I was trying the right config in the wrong place.
Update: after a lot of configs, we are able now to add local commits to the private repo on deployment-puppetserver-1. I tried various hacks to override the value that we set in profile::pki::client::auth_key, but it seems not picked up.
I have seen the same behavior, namely pip trying to download the torch's cpu version and ending up only installing nvidia-related packages. I like the explicit-dependency solution, it is less flexible than letting pip to manage dependencies but I think it is the only viable way to get a good result.
I think that by default any puppetmaster that pulls data from another repository (so it is not the canonical source of truth) has this protection to avoid mistakes. I see as well some actually-private/local tags for some commits in the git history, probably we are missing some configs.
@Andrew Hi! I don't see any instance named like that in deployment-prep, IIRC we deleted it, it is not used anymore.
There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).
PKI intermediate cloud node fixed, now I think that we'd need to fix the second biggest issue pointed out by Amir, namely the fact that the auth token is probably misconfigured on a lot of cloud nodes.
Fri, Mar 29
Ok the fact that the PKI node with the intermediate CA is broken is not great, let's try to fix that first.
The perms are the same for the wmf cacert bundle:
I did the following on logstash1036:
Hi folks! IIRC the permission of the file (rw only by root) should be enough to avoid injection of new CA certificates, but I'll double check and report back.
Thu, Mar 28
There is an obstacle with the current approach that I didn't think about. In the current setup, this happens:
@herron something really strange: https://w.wiki/9bMW