After some tests I was able to find an istio configuration to support both transparent and non-transparent proxy settings (namely, setting WIKI_URL or similar to a discovery endpoint or just using http://en.wikipedia.org).
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Today
Next steps:
- Roll out PKI to aqs eqiad (codfw already done and it is running fine)
- Rollout the new Truststore to all Restbase nodes (prep-step to deploy PKI).
@JAllemandou Hi! I have a question for you when you have a moment :)
Yesterday
Wed, Apr 17
I've noticed some autoscaling, and high cpu usage in the kserve containers. I've raised the min/max replicas from 1/4 to 4/6, and with more capacity the latency is way better (at least for now).
We are still seeing high latency for ruwiki's damaging, the current theory is that some rev-ids are causing troubles in the preprocessing (feature extraction etc..) phase ending up into an inconsistent state, that affects the other requests as well. We'll deploy https://gerrit.wikimedia.org/r/1020898 as stop-gap to figure out what requests are causing this issue, and then we'll try to find a fix. For the moment users may experience troubles when calling lift wing, apologies in advance, but please keep reporting connectivity issues if you find them.
Saved ruwiki's pod logs to deploy1002:/home/elukey/T362503
Overall steps:
Today I found out https://github.com/istio/istio/issues/21914 after a lot of debugging in staging for T362316. The main issue that I was trying to solve was that after applying the new ServiceEntry/DestinationRule/VirtualService config for T362316, everything worked like a charm, but suddenly stopped if https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/984214 was reverted (manually deleting the extra ServiceEntry).
Added some thoughts to T353622#9723070, I found out a big can of worms while testing staging :) The upgrade is more complex than anticipated, but we should be able to do it this or next week maximum.
There are two kinds of istio metrics - the ones from the gateway and the ones from the sidecars (inbound and outbound). In theory it should be sufficient to check the Gateway metrics, since if a sidecar misbehaves it should be clearly visible from it, and it should reduce the volume of metrics pulled even further. The gateway metrics should be distinguishable from the rest via the kubernetes_namespace="istio-system".
Tue, Apr 16
In T352647#9718545, @Eevans wrote:I propose we carry on with the migration to PKI, accepting that Cassandra-based golang services will have to have verification disabled for now. It's not a regression, so I don't think we should let it hold up this work.
Created two follow ups:
@Eevans sigh :( I found https://github.com/gocql/gocql/issues/1611 that may help, I didn't have time to check the code though.
Mon, Apr 15
In T362503#9712500, @Q-bit-array wrote:As a creator of the other ticket (T362506) I would add that the ORES/LiftWing infrastructure in the Russian Wikipedia was quite unstable during the whole last week. There were numerous small outages, ranging from few minutes to about one hour. But the recent one was really long: 16-18 hours.
Is it perhaps possible to install some monitoring to prevent such issues in the future?
I think that multiple requests caused a ton of time spent in preprocess(), causing the isvc to totally stall and get into a weird state (most probably revscoring ended up in a weird/not-working state).
Something strange: from the istio gateway logs, the HTTP response code logged is 0 https://logstash.wikimedia.org/goto/9003f0bd1a3c34e303ac5fbe86eff693
Fri, Apr 12
Current status:
Thu, Apr 11
aqs1010's instances are running with PKI TLS certs, so far everything looks good. I had a chat with Eric and Ben, we'll let it run until next week to catch issues and then we'll proceed with the rest of the cluster.
Very ignorant about the internals but the procedure seems sound! In the ML case, we could live without backfilling the previous quarters/SLO time series, the more pressing thing is to start from a clean state (without gaps etc..). Thanks for the work!
Tue, Apr 9
In T361483#9688445, @akosiaris wrote:In T361483#9680093, @elukey wrote:In T361483#9680024, @akosiaris wrote:In T361483#9679703, @elukey wrote:Hello!
There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).
You are correct. I 'll post a patch then to remove it. Thanks!
For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".
This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.
No problem for me! I can only see one issue, and this is something not specific to our topics: if we start another job in cp-jobqueue, the kafka consumer offset will be reset to whatever is the last element in the topic, and we'll potentially loose events in the stream. It is not a huge deal since at the moment nothing incredibly critical relies on them, but IIRC Search uses one of the running topics to update Elastic Search. If we move everything over we'd need to sync with them and figure out if a "hole" in the stream is acceptable, otherwise the only thing that I can think of is:
- stop the changeprop rule for the lift wing topic that Search uses.
- write down the offset of the related consumer group using the kafka api (IIRC it should be possible)
- create another consumer group in cp-jobqueue with the same initial offset (this is not super difficult but I have never done it).
- add the rule to cp-jobqueue and check if it works.
Hmmm, can these endpoints receive the same request 2 times? I see that all that changeprop does is a POST to https://inference.discovery.wmnet:30443/v1/models/<wiki>/<sometopic> with a body that contains event: '{{globals.message}}'
And the rules make it apparently pretty easy to have them both run from changeprop and jobqueue simultaneously. That way we might just run both for a while (a couple of days?) and then just shutdown the changeprop parts of it, leaving jobqueue to continue as normal.
Thanks to Aiko that fixed some issues with RR Wikidata and ML, the new code is now deployed to all the model servers that used to have OMP_NUM_THREADS explicitly stated in deployment-charts. The model servers work fine and their performance is good.
I opened T362181 for Airflow/Spark clients, IIUC they don't currently use TLS so we should be good.
Mon, Apr 8
@MatthewVernon we could do something like the following:
Fri, Apr 5
Last step is to review/merge the puppet changes listed above, then we can close!
Created new /116 for AUX: https://netbox.wikimedia.org/ipam/prefixes/930/
I tried to check the Cassandra AQS' clients and how they trust/validate TLS certificates. IIUC all the clients are on k8s and using the cassandra-http-gateway as chart, that renders a config file like /etc/cassandra-http-gateway/config.yaml containing various info about how to connect to a Cassandra cluster, and among those I found:
Added a change (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013571/ to be able to keep the ca-manager's ca-bundle around, since currently it disappears from the puppet catalog (so the file stays on the node, but if we reimage it will not be re-created etc..).
The new option is meant only as a compromise to fully transition us to cfssl without too much pain :)
I have updated the docs for the renewal use case, I don't think that we need to change anything in the cert's manifest for this use case (renewal).
Thu, Apr 4
Rolled out Dragonfly to all ml clusters!
All subtasks completed, wrapping up the task, thanks to all for feedback/help/support! <3
We have created two base images, one for Pytorch 2.2.x and one for 2.1.x, they will be tested and used with Revert Risk ML and Hugging face's model server.
Hi Jeff!
Wed, Apr 3
The old certificates (cergen based) for Kafka Logging in deployment-prep expired and we worked to add a PKI-based TLS cert instead of the old one (self managed via puppet etc.., no need for any manual and periodical renew). All the puppet work was related to make PKI to work in deployment-prep, since there were several issues that caused failures. The puppet master rebuild caused some troubles when trying to apply private commits in deployment-prep's private repo. While at it, the deployment-prep's kafka jumbo cluster was moved to PKI as well to fix an issue with varnishkafka running on the deployment-prep's cache node (TLS cert expiry as well).
I think that all the work is done, varnishkafka seems working fine now and the logs are flowing nicely. Shall we close?
In T361328#9674202, @Ladsgroup wrote:It's great that it's not a big deal, it scared me for good. Wonder how much work it is to change the password to avoid further scares in the future.
Tue, Apr 2
In T360595#9681147, @thcipriani wrote:@thcipriani I haven't forgot about varnishkafka, if the above works we can apply the same fix to the other kafka nodes and we should be good.
That'd be great! Thanks so much @elukey !
Ok so the issue was that the profile::pki::client::auth_key value set under hiera's profile settings was not picked up (thanks Taavi for the help) so I was trying the right config in the wrong place.
Update: after a lot of configs, we are able now to add local commits to the private repo on deployment-puppetserver-1. I tried various hacks to override the value that we set in profile::pki::client::auth_key, but it seems not picked up.
In T361483#9680024, @akosiaris wrote:In T361483#9679703, @elukey wrote:Hello!
There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).
You are correct. I 'll post a patch then to remove it. Thanks!
For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".
This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.
I have seen the same behavior, namely pip trying to download the torch's cpu version and ending up only installing nvidia-related packages. I like the explicit-dependency solution, it is less flexible than letting pip to manage dependencies but I think it is the only viable way to get a good result.
I think that by default any puppetmaster that pulls data from another repository (so it is not the canonical source of truth) has this protection to avoid mistakes. I see as well some actually-private/local tags for some commits in the git history, probably we are missing some configs.
@Andrew Hi! I don't see any instance named like that in deployment-prep, IIRC we deleted it, it is not used anymore.
There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).
PKI intermediate cloud node fixed, now I think that we'd need to fix the second biggest issue pointed out by Amir, namely the fact that the auth token is probably misconfigured on a lot of cloud nodes.
Fri, Mar 29
Ok the fact that the PKI node with the intermediate CA is broken is not great, let's try to fix that first.
The perms are the same for the wmf cacert bundle:
I did the following on logstash1036:
Hi folks! IIRC the permission of the file (rw only by root) should be enough to avoid injection of new CA certificates, but I'll double check and report back.
Thu, Mar 28
There is an obstacle with the current approach that I didn't think about. In the current setup, this happens:
@herron something really strange: https://w.wiki/9bMW
Commands executed, new status:
Wed, Mar 27
Use case to test:
To keep archives happy:
Everything done!
Tue, Mar 26
Do the PSS give the same early feedback even with Deployment objects?
During the SIG meeting we wondered what is the feedback that a deployer would get from PSS vs VAP+CEL, we knew the latter (namely the Deployment/Pod/etc.. resources are allowed to be created but the corresponding resource would not be created if a policy is breached) but not the former.
High level plan for codfw:
Mon, Mar 25
With https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014035/ we changed how labels are collected on the Prometheus nodes. We now have a specific job called k8s-pods-istio that collects only istio metrics, and that applies some policies to what labels need to be kept. We dropped 14 labels from the original set, so hopefully the time series are now easier to manage. We'll keep the Thanos SLI recording rules monitored for a bit, to measure the difference (if any, hopefully yes) in performance.