Page MenuHomePhabricator
Feed Advanced Search

Today

elukey added a comment to T353622: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing.

After some tests I was able to find an istio configuration to support both transparent and non-transparent proxy settings (namely, setting WIKI_URL or similar to a discovery endpoint or just using http://en.wikipedia.org).

Fri, Apr 19, 3:15 PM · Patch-For-Review, Machine-Learning-Team
elukey committed rMLISea4bf862bd7f: revscoring_model: fix typo in error msg.
revscoring_model: fix typo in error msg
Fri, Apr 19, 1:14 PM
elukey added a comment to T352647: Move Cassandra clusters to PKI.

Next steps:

  • Roll out PKI to aqs eqiad (codfw already done and it is running fine)
  • Rollout the new Truststore to all Restbase nodes (prep-step to deploy PKI).
Fri, Apr 19, 1:10 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey updated subscribers of T362181: Encrypt Airflow connections to AQS Cassandra.

@JAllemandou Hi! I have a question for you when you have a moment :)

Fri, Apr 19, 1:09 PM · Data-Platform-SRE, Data-Engineering, Data-Persistence, Cassandra

Yesterday

elukey committed rMLISce4e044c3163: revscoring: add flag to log JSON inputs.
revscoring: add flag to log JSON inputs
Thu, Apr 18, 4:13 AM

Wed, Apr 17

elukey added a comment to T362503: ORES doesn't work (at least for ru- and ukwiki).

I've noticed some autoscaling, and high cpu usage in the kserve containers. I've raised the min/max replicas from 1/4 to 4/6, and with more capacity the latency is way better (at least for now).

Wed, Apr 17, 7:21 PM · Patch-For-Review, Machine-Learning-Team, ORES
elukey added a comment to T362503: ORES doesn't work (at least for ru- and ukwiki).

We are still seeing high latency for ruwiki's damaging, the current theory is that some rev-ids are causing troubles in the preprocessing (feature extraction etc..) phase ending up into an inconsistent state, that affects the other requests as well. We'll deploy https://gerrit.wikimedia.org/r/1020898 as stop-gap to figure out what requests are causing this issue, and then we'll try to find a fix. For the moment users may experience troubles when calling lift wing, apologies in advance, but please keep reporting connectivity issues if you find them.

Wed, Apr 17, 7:12 PM · Patch-For-Review, Machine-Learning-Team, ORES
elukey added a comment to T362503: ORES doesn't work (at least for ru- and ukwiki).

Saved ruwiki's pod logs to deploy1002:/home/elukey/T362503

Wed, Apr 17, 6:25 PM · Patch-For-Review, Machine-Learning-Team, ORES
elukey added a comment to T353622: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing.

Overall steps:

Wed, Apr 17, 3:52 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T353622: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing.

Today I found out https://github.com/istio/istio/issues/21914 after a lot of debugging in staging for T362316. The main issue that I was trying to solve was that after applying the new ServiceEntry/DestinationRule/VirtualService config for T362316, everything worked like a charm, but suddenly stopped if https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/984214 was reverted (manually deleting the extra ServiceEntry).

Wed, Apr 17, 3:47 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T362316: Migrate ml-services to mw-api-int.

Added some thoughts to T353622#9723070, I found out a big can of worms while testing staging :) The upgrade is more complex than anticipated, but we should be able to do it this or next week maximum.

Wed, Apr 17, 3:47 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
elukey added a comment to T362661: Create basic alerts for isvcs to catch outages.

There are two kinds of istio metrics - the ones from the gateway and the ones from the sidecars (inbound and outbound). In theory it should be sufficient to check the Gateway metrics, since if a sidecar misbehaves it should be clearly visible from it, and it should reduce the volume of metrics pulled even further. The gateway metrics should be distinguishable from the rest via the kubernetes_namespace="istio-system".

Wed, Apr 17, 11:36 AM · Patch-For-Review, Machine-Learning-Team, ORES

Tue, Apr 16

elukey added a comment to T352647: Move Cassandra clusters to PKI.

I propose we carry on with the migration to PKI, accepting that Cassandra-based golang services will have to have verification disabled for now. It's not a regression, so I don't think we should let it hold up this work.

Tue, Apr 16, 3:18 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey added a comment to T362503: ORES doesn't work (at least for ru- and ukwiki).

Created two follow ups:

Tue, Apr 16, 1:59 PM · Patch-For-Review, Machine-Learning-Team, ORES
elukey created T362663: Add slow-logs for ML isvcs.
Tue, Apr 16, 1:58 PM · Patch-For-Review, Machine-Learning-Team, ORES
elukey created T362661: Create basic alerts for isvcs to catch outages.
Tue, Apr 16, 1:51 PM · Patch-For-Review, Machine-Learning-Team, ORES
elukey added a comment to T352647: Move Cassandra clusters to PKI.

@Eevans sigh :( I found https://github.com/gocql/gocql/issues/1611 that may help, I didn't have time to check the code though.

Tue, Apr 16, 1:10 PM · Patch-For-Review, Data-Persistence, Cassandra

Mon, Apr 15

elukey triaged T362503: ORES doesn't work (at least for ru- and ukwiki) as High priority.
Mon, Apr 15, 4:24 PM · Patch-For-Review, Machine-Learning-Team, ORES
elukey added a comment to T362503: ORES doesn't work (at least for ru- and ukwiki).

As a creator of the other ticket (T362506) I would add that the ORES/LiftWing infrastructure in the Russian Wikipedia was quite unstable during the whole last week. There were numerous small outages, ranging from few minutes to about one hour. But the recent one was really long: 16-18 hours.
Is it perhaps possible to install some monitoring to prevent such issues in the future?

Mon, Apr 15, 4:24 PM · Patch-For-Review, Machine-Learning-Team, ORES
elukey added a comment to T362503: ORES doesn't work (at least for ru- and ukwiki).

I think that multiple requests caused a ton of time spent in preprocess(), causing the isvc to totally stall and get into a weird state (most probably revscoring ended up in a weird/not-working state).

Mon, Apr 15, 4:22 PM · Patch-For-Review, Machine-Learning-Team, ORES
elukey updated the task description for T352647: Move Cassandra clusters to PKI.
Mon, Apr 15, 3:03 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey added a comment to T362503: ORES doesn't work (at least for ru- and ukwiki).

Something strange: from the istio gateway logs, the HTTP response code logged is 0 https://logstash.wikimedia.org/goto/9003f0bd1a3c34e303ac5fbe86eff693

Mon, Apr 15, 2:46 PM · Patch-For-Review, Machine-Learning-Team, ORES

Fri, Apr 12

elukey placed T362316: Migrate ml-services to mw-api-int up for grabs.
Fri, Apr 12, 3:59 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
elukey added a comment to T362316: Migrate ml-services to mw-api-int.

Current status:

Fri, Apr 12, 3:58 PM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
elukey committed rLPRIb9deb4728a30: role::cassandra_dev: add fake truststore password for PKI.
role::cassandra_dev: add fake truststore password for PKI
Fri, Apr 12, 3:13 PM

Thu, Apr 11

elukey added a comment to T352647: Move Cassandra clusters to PKI.

aqs1010's instances are running with PKI TLS certs, so far everything looks good. I had a chat with Eric and Ben, we'll let it run until next week to catch issues and then we'll proceed with the rest of the cluster.

Thu, Apr 11, 4:18 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey added a comment to T349521: Prometheus/Pyrra: establish backfill process for recording rules.

Very ignorant about the internals but the procedure seems sound! In the ML case, we could live without backfilling the previous quarters/SLO time series, the more pressing thing is to start from a clean state (without gaps etc..). Thanks for the work!

Thu, Apr 11, 3:27 PM · Patch-For-Review, User-herron, Observability-Metrics

Tue, Apr 9

elukey added a comment to T361483: Selectively disable changeprop functionality that is no longer used.

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

You are correct. I 'll post a patch then to remove it. Thanks!

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.

No problem for me! I can only see one issue, and this is something not specific to our topics: if we start another job in cp-jobqueue, the kafka consumer offset will be reset to whatever is the last element in the topic, and we'll potentially loose events in the stream. It is not a huge deal since at the moment nothing incredibly critical relies on them, but IIRC Search uses one of the running topics to update Elastic Search. If we move everything over we'd need to sync with them and figure out if a "hole" in the stream is acceptable, otherwise the only thing that I can think of is:

  • stop the changeprop rule for the lift wing topic that Search uses.
  • write down the offset of the related consumer group using the kafka api (IIRC it should be possible)
  • create another consumer group in cp-jobqueue with the same initial offset (this is not super difficult but I have never done it).
  • add the rule to cp-jobqueue and check if it works.

Hmmm, can these endpoints receive the same request 2 times? I see that all that changeprop does is a POST to https://inference.discovery.wmnet:30443/v1/models/<wiki>/<sometopic> with a body that contains event: '{{globals.message}}'

And the rules make it apparently pretty easy to have them both run from changeprop and jobqueue simultaneously. That way we might just run both for a while (a couple of days?) and then just shutdown the changeprop parts of it, leaving jobqueue to continue as normal.

Tue, Apr 9, 3:55 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTbase Deprecation Roadmap)
elukey moved T360111: Set automatically libomp's num threads when using Pytorch from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Tue, Apr 9, 3:53 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T360111: Set automatically libomp's num threads when using Pytorch.

Thanks to Aiko that fixed some issues with RR Wikidata and ML, the new code is now deployed to all the model servers that used to have OMP_NUM_THREADS explicitly stated in deployment-charts. The model servers work fine and their performance is good.

Tue, Apr 9, 3:53 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T352647: Move Cassandra clusters to PKI.

I opened T362181 for Airflow/Spark clients, IIUC they don't currently use TLS so we should be good.

Tue, Apr 9, 3:38 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey created T362181: Encrypt Airflow connections to AQS Cassandra.
Tue, Apr 9, 3:36 PM · Data-Platform-SRE, Data-Engineering, Data-Persistence, Cassandra

Mon, Apr 8

elukey added a comment to T361844: Swift TLS certificates will expire soon (14 April).

@MatthewVernon we could do something like the following:

Mon, Apr 8, 4:33 PM · Patch-For-Review, SRE-swift-storage
jcrespo awarded T352647: Move Cassandra clusters to PKI a Love token.
Mon, Apr 8, 2:07 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey closed T353705: Fix IPv6 service IP ranges for all Kubernetes clusters as Resolved.
Mon, Apr 8, 12:46 PM · Data-Platform-SRE
elukey moved T353622: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing from Backlog/Lift Wing to In Progress on the Machine-Learning-Team board.
Mon, Apr 8, 12:33 PM · Patch-For-Review, Machine-Learning-Team
elukey claimed T353622: Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing.
Mon, Apr 8, 12:33 PM · Patch-For-Review, Machine-Learning-Team
elukey placed T360120: Run unit tests for the inference-services repo in CI up for grabs.
Mon, Apr 8, 12:32 PM · Machine-Learning-Team
elukey moved T352756: Gap in metrics rendered from Thanos Rules from Backlog/SRE to Blocked on the Machine-Learning-Team board.
Mon, Apr 8, 12:31 PM · SRE Observability (FY2023/2024-Q4), Observability-Metrics, Machine-Learning-Team
elukey claimed T352756: Gap in metrics rendered from Thanos Rules.
Mon, Apr 8, 12:31 PM · SRE Observability (FY2023/2024-Q4), Observability-Metrics, Machine-Learning-Team
elukey moved T351390: Istio recording rules for Pyrra and Grizzly from Backlog/SRE to Blocked on the Machine-Learning-Team board.
Mon, Apr 8, 12:31 PM · Patch-For-Review, Machine-Learning-Team, observability
elukey updated the task description for T361964: Golang-based Cassandra clients do not perform TLS host verification.
Mon, Apr 8, 9:27 AM · AQS2.0, Data Products, Cassandra

Fri, Apr 5

elukey committed rMLISf8ee123de454: python: upgrade aiohttp's version to avoid issues with py3.11.
python: upgrade aiohttp's version to avoid issues with py3.11
Fri, Apr 5, 3:50 PM
elukey claimed T353705: Fix IPv6 service IP ranges for all Kubernetes clusters.
Fri, Apr 5, 3:40 PM · Data-Platform-SRE
elukey added a comment to T353705: Fix IPv6 service IP ranges for all Kubernetes clusters.

Last step is to review/merge the puppet changes listed above, then we can close!

Fri, Apr 5, 3:39 PM · Data-Platform-SRE
elukey updated the task description for T353705: Fix IPv6 service IP ranges for all Kubernetes clusters.
Fri, Apr 5, 3:36 PM · Data-Platform-SRE
elukey added a comment to T353705: Fix IPv6 service IP ranges for all Kubernetes clusters.

Created new /116 for AUX: https://netbox.wikimedia.org/ipam/prefixes/930/

Fri, Apr 5, 3:24 PM · Data-Platform-SRE
elukey closed T360638: Create a Pytorch base image as Resolved.
Fri, Apr 5, 2:14 PM · Patch-For-Review, Machine-Learning-Team
elukey closed T360638: Create a Pytorch base image, a subtask of T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images, as Resolved.
Fri, Apr 5, 2:13 PM · Machine-Learning-Team
elukey closed T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images as Resolved.
Fri, Apr 5, 2:13 PM · Machine-Learning-Team
elukey closed T359416: Add Dragonfly to the ML k8s clusters as Resolved.
Fri, Apr 5, 2:13 PM · Machine-Learning-Team
elukey closed T359416: Add Dragonfly to the ML k8s clusters, a subtask of T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images, as Resolved.
Fri, Apr 5, 2:13 PM · Machine-Learning-Team
elukey moved T359879: SLO dashboards for Lift Wing showing unexpected values from In Progress to Blocked on the Machine-Learning-Team board.
Fri, Apr 5, 2:13 PM · Patch-For-Review, Machine-Learning-Team, Observability-Metrics
elukey added a comment to T352647: Move Cassandra clusters to PKI.

I tried to check the Cassandra AQS' clients and how they trust/validate TLS certificates. IIUC all the clients are on k8s and using the cassandra-http-gateway as chart, that renders a config file like /etc/cassandra-http-gateway/config.yaml containing various info about how to connect to a Cassandra cluster, and among those I found:

Fri, Apr 5, 2:05 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey added a comment to T352647: Move Cassandra clusters to PKI.

Added a change (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013571/ to be able to keep the ca-manager's ca-bundle around, since currently it disappears from the puppet catalog (so the file stays on the node, but if we reimage it will not be re-created etc..).
The new option is meant only as a compromise to fully transition us to cfssl without too much pain :)

Fri, Apr 5, 1:08 PM · Patch-For-Review, Data-Persistence, Cassandra
elukey added a comment to T361844: Swift TLS certificates will expire soon (14 April).

I have updated the docs for the renewal use case, I don't think that we need to change anything in the cert's manifest for this use case (renewal).

Fri, Apr 5, 8:06 AM · Patch-For-Review, SRE-swift-storage

Thu, Apr 4

elukey moved T360637: Bump memory for registry[12]00[34] VMs from Ready To Go to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Thu, Apr 4, 3:57 PM · Patch-For-Review, serviceops, Machine-Learning-Team
elukey moved T359416: Add Dragonfly to the ML k8s clusters from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Thu, Apr 4, 3:57 PM · Machine-Learning-Team
elukey added a comment to T359416: Add Dragonfly to the ML k8s clusters.

Rolled out Dragonfly to all ml clusters!

Thu, Apr 4, 3:56 PM · Machine-Learning-Team
elukey moved T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images from Blocked to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Thu, Apr 4, 3:56 PM · Machine-Learning-Team
elukey added a comment to T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images.

All subtasks completed, wrapping up the task, thanks to all for feedback/help/support! <3

Thu, Apr 4, 3:56 PM · Machine-Learning-Team
elukey moved T360638: Create a Pytorch base image from Blocked to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Thu, Apr 4, 3:55 PM · Patch-For-Review, Machine-Learning-Team
elukey moved T360638: Create a Pytorch base image from Ready To Go to Blocked on the Machine-Learning-Team board.
Thu, Apr 4, 3:55 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T360638: Create a Pytorch base image.

We have created two base images, one for Pytorch 2.2.x and one for 2.1.x, they will be tested and used with Revert Risk ML and Hugging face's model server.

Thu, Apr 4, 3:54 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T360779: Phase out cergen for Fundraising services.

Hi Jeff!

Thu, Apr 4, 12:54 PM · SRE

Wed, Apr 3

elukey added a comment to T360595: beta-scap-sync-world fails: logstash_checker.py: KeyError: 'aggregations'.

The old certificates (cergen based) for Kafka Logging in deployment-prep expired and we worked to add a PKI-based TLS cert instead of the old one (self managed via puppet etc.., no need for any manual and periodical renew). All the puppet work was related to make PKI to work in deployment-prep, since there were several issues that caused failures. The puppet master rebuild caused some troubles when trying to apply private commits in deployment-prep's private repo. While at it, the deployment-prep's kafka jumbo cluster was moved to PKI as well to fix an issue with varnishkafka running on the deployment-prep's cache node (TLS cert expiry as well).

Wed, Apr 3, 7:39 PM · Infrastructure-Foundations, Patch-For-Review, SRE Observability, observability, Beta-Cluster-Infrastructure, Release-Engineering-Team
elukey added a comment to T360595: beta-scap-sync-world fails: logstash_checker.py: KeyError: 'aggregations'.

I think that all the work is done, varnishkafka seems working fine now and the logs are flowing nicely. Shall we close?

Wed, Apr 3, 1:17 PM · Infrastructure-Foundations, Patch-For-Review, SRE Observability, observability, Beta-Cluster-Infrastructure, Release-Engineering-Team
elukey committed rLPRId6046f2c0884: profile::pki::client: re-introduce fake auth token.
profile::pki::client: re-introduce fake auth token
Wed, Apr 3, 1:04 PM
elukey committed rLPRI3cea952f1cfb: Remove profile::pki::client's specific hiera config.
Remove profile::pki::client's specific hiera config
Wed, Apr 3, 12:52 PM
elukey added a comment to T361328: Password to keystore of java certificates needs changing.

It's great that it's not a big deal, it scared me for good. Wonder how much work it is to change the password to avoid further scares in the future.

Wed, Apr 3, 12:51 PM · SecTeam-Processed, Security

Tue, Apr 2

elukey added a comment to T360595: beta-scap-sync-world fails: logstash_checker.py: KeyError: 'aggregations'.

@thcipriani I haven't forgot about varnishkafka, if the above works we can apply the same fix to the other kafka nodes and we should be good.

That'd be great! Thanks so much @elukey !

Tue, Apr 2, 4:28 PM · Infrastructure-Foundations, Patch-For-Review, SRE Observability, observability, Beta-Cluster-Infrastructure, Release-Engineering-Team
elukey added a comment to T360595: beta-scap-sync-world fails: logstash_checker.py: KeyError: 'aggregations'.

Ok so the issue was that the profile::pki::client::auth_key value set under hiera's profile settings was not picked up (thanks Taavi for the help) so I was trying the right config in the wrong place.

Tue, Apr 2, 3:45 PM · Infrastructure-Foundations, Patch-For-Review, SRE Observability, observability, Beta-Cluster-Infrastructure, Release-Engineering-Team
elukey added a reverting change for rLPRI07717b16bda0: Remove profile::pki::client::auth_key from common.yaml: rLPRI52ef80f791e1: Revert "Remove profile::pki::client::auth_key from common.yaml".
Tue, Apr 2, 3:17 PM
elukey committed rLPRI52ef80f791e1: Revert "Remove profile::pki::client::auth_key from common.yaml".
Revert "Remove profile::pki::client::auth_key from common.yaml"
Tue, Apr 2, 3:17 PM
elukey committed rLPRI07717b16bda0: Remove profile::pki::client::auth_key from common.yaml.
Remove profile::pki::client::auth_key from common.yaml
Tue, Apr 2, 3:10 PM
elukey added a comment to T360595: beta-scap-sync-world fails: logstash_checker.py: KeyError: 'aggregations'.

Update: after a lot of configs, we are able now to add local commits to the private repo on deployment-puppetserver-1. I tried various hacks to override the value that we set in profile::pki::client::auth_key, but it seems not picked up.

Tue, Apr 2, 3:05 PM · Infrastructure-Foundations, Patch-For-Review, SRE Observability, observability, Beta-Cluster-Infrastructure, Release-Engineering-Team
elukey added a comment to T361483: Selectively disable changeprop functionality that is no longer used.

Hello!

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

You are correct. I 'll post a patch then to remove it. Thanks!

For Lift Wing, we just use CP to call inference.discovery.wmnet, no restbase involved. The idea is to create streams like "for every rev-id, get a score from a Lift Wing model".

This is probably something we want to move away from Changeprop then and in the jobqueue (same software, I know, but a different installation). Looking at the config, I think that there is no code that is specific to LiftWing, just standard reaction to events on kafka.

Tue, Apr 2, 1:22 PM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTbase Deprecation Roadmap)
elukey added a comment to T357986: Use Huggingface model server image for HF LLMs.

I have seen the same behavior, namely pip trying to download the torch's cpu version and ending up only installing nvidia-related packages. I like the explicit-dependency solution, it is less flexible than letting pip to manage dependencies but I think it is the only viable way to get a good result.

Tue, Apr 2, 1:01 PM · Patch-For-Review, Machine-Learning-Team
elukey updated subscribers of T360595: beta-scap-sync-world fails: logstash_checker.py: KeyError: 'aggregations'.

I think that by default any puppetmaster that pulls data from another repository (so it is not the canonical source of truth) has this protection to avoid mistakes. I see as well some actually-private/local tags for some commits in the git history, probably we are missing some configs.

Tue, Apr 2, 1:00 PM · Infrastructure-Foundations, Patch-For-Review, SRE Observability, observability, Beta-Cluster-Infrastructure, Release-Engineering-Team
elukey added a comment to T361385: Replace deployment-ores02.

@Andrew Hi! I don't see any instance named like that in deployment-prep, IIRC we deleted it, it is not used anymore.

Tue, Apr 2, 12:38 PM · cloud-services-team, Cloud-VPS (Debian Buster Deprecation), Beta-Cluster-Infrastructure
elukey added a comment to T361483: Selectively disable changeprop functionality that is no longer used.

There ores_cache job should be defined but disabled in the running config, we don't use it anymore and IIRC it is not running anymore in ChangeProp (lemme know otherwise).

Tue, Apr 2, 11:35 AM · Patch-For-Review, Machine-Learning-Team, Lift-Wing, ORES, RESTBase Sunsetting, Content-Transform-Team, serviceops, ChangeProp, API Platform (RESTbase Deprecation Roadmap)
elukey added a comment to T360595: beta-scap-sync-world fails: logstash_checker.py: KeyError: 'aggregations'.

PKI intermediate cloud node fixed, now I think that we'd need to fix the second biggest issue pointed out by Amir, namely the fact that the auth token is probably misconfigured on a lot of cloud nodes.

Tue, Apr 2, 9:08 AM · Infrastructure-Foundations, Patch-For-Review, SRE Observability, observability, Beta-Cluster-Infrastructure, Release-Engineering-Team

Fri, Mar 29

elukey reassigned T361384: Replace deployment-memc[08-10] with Bullseye or Bookworm from elukey to jijiki.
Fri, Mar 29, 5:10 PM · serviceops, Cloud-VPS (Debian Buster Deprecation), Beta-Cluster-Infrastructure, cloud-services-team
elukey added a comment to T360595: beta-scap-sync-world fails: logstash_checker.py: KeyError: 'aggregations'.

Ok the fact that the PKI node with the intermediate CA is broken is not great, let's try to fix that first.

Fri, Mar 29, 3:58 PM · Infrastructure-Foundations, Patch-For-Review, SRE Observability, observability, Beta-Cluster-Infrastructure, Release-Engineering-Team
elukey added a comment to T361328: Password to keystore of java certificates needs changing.

The perms are the same for the wmf cacert bundle:

Fri, Mar 29, 10:43 AM · SecTeam-Processed, Security
elukey added a comment to T361328: Password to keystore of java certificates needs changing.

I did the following on logstash1036:

Fri, Mar 29, 10:39 AM · SecTeam-Processed, Security
elukey added a comment to T361328: Password to keystore of java certificates needs changing.

Hi folks! IIRC the permission of the file (rw only by root) should be enough to avoid injection of new CA certificates, but I'll double check and report back.

Fri, Mar 29, 10:26 AM · SecTeam-Processed, Security

Thu, Mar 28

elukey added a comment to T360638: Create a Pytorch base image.

There is an obstacle with the current approach that I didn't think about. In the current setup, this happens:

Thu, Mar 28, 5:43 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T359879: SLO dashboards for Lift Wing showing unexpected values.

@herron something really strange: https://w.wiki/9bMW

Thu, Mar 28, 3:26 PM · Patch-For-Review, Machine-Learning-Team, Observability-Metrics
BTullis awarded T361225: Update GPU labels in Hadoop 's Yarn config a Yellow Medal token.
Thu, Mar 28, 3:02 PM · Data-Platform-SRE
elukey closed T361225: Update GPU labels in Hadoop 's Yarn config as Resolved.

Commands executed, new status:

Thu, Mar 28, 3:00 PM · Data-Platform-SRE
elukey created T361225: Update GPU labels in Hadoop 's Yarn config.
Thu, Mar 28, 1:40 PM · Data-Platform-SRE

Wed, Mar 27

elukey added a comment to T360638: Create a Pytorch base image.

Use case to test:

Wed, Mar 27, 3:08 PM · Patch-For-Review, Machine-Learning-Team
elukey added a comment to T360638: Create a Pytorch base image.

To keep archives happy:

Wed, Mar 27, 3:05 PM · Patch-For-Review, Machine-Learning-Team
elukey closed T360637: Bump memory for registry[12]00[34] VMs as Resolved.

Everything done!

Wed, Mar 27, 12:04 PM · Patch-For-Review, serviceops, Machine-Learning-Team
elukey closed T360637: Bump memory for registry[12]00[34] VMs, a subtask of T359067: Find an efficient strategy to add Pytorch and ROCm packages to our Docker images, as Resolved.
Wed, Mar 27, 12:04 PM · Machine-Learning-Team

Tue, Mar 26

elukey added a comment to T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

Do the PSS give the same early feedback even with Deployment objects?

Tue, Mar 26, 6:11 PM · Patch-For-Review, serviceops, Prod-Kubernetes
elukey added a comment to T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.

During the SIG meeting we wondered what is the feedback that a deployer would get from PSS vs VAP+CEL, we knew the latter (namely the Deployment/Pod/etc.. resources are allowed to be created but the corresponding resource would not be created if a policy is breached) but not the former.

Tue, Mar 26, 5:59 PM · Patch-For-Review, serviceops, Prod-Kubernetes
elukey added a comment to T360637: Bump memory for registry[12]00[34] VMs.

High level plan for codfw:

Tue, Mar 26, 3:36 PM · Patch-For-Review, serviceops, Machine-Learning-Team

Mon, Mar 25

elukey added a comment to T351390: Istio recording rules for Pyrra and Grizzly.

With https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014035/ we changed how labels are collected on the Prometheus nodes. We now have a specific job called k8s-pods-istio that collects only istio metrics, and that applies some policies to what labels need to be kept. We dropped 14 labels from the original set, so hopefully the time series are now easier to manage. We'll keep the Thanos SLI recording rules monitored for a bit, to measure the difference (if any, hopefully yes) in performance.

Mon, Mar 25, 4:42 PM · Patch-For-Review, Machine-Learning-Team, observability