Page MenuHomePhabricator

elukey (Luca Toscano)
Site Reliability Engineer - Analytics/Data engineering

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Jan 5 2016, 9:54 PM (333 w, 4 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
LToscano (WMF) [ Global Accounts ]

Recent Activity

Fri, May 27

elukey added a comment to T302851: revscoring feature extraction error for wikitext papes in Wikidata .

Aiko deployed the change to deployment-prep, it looks very good:

Fri, May 27, 2:08 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks), ORES
elukey closed T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing as Resolved.

All pods running revscoring 2.11.4. Tested various endpoints and I can see scores correctly, no errors in the logs.

Fri, May 27, 2:03 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks)
elukey committed rMLIS391f80b8fc06: draftquality: null change to trigger image publishing (authored by elukey).
draftquality: null change to trigger image publishing
Fri, May 27, 1:39 PM
elukey closed T307927: Unable to run helmfile and check pods as Resolved.

Kevin was able to deploy successfully without issues, so I think that we can close for the moment!

Fri, May 27, 9:28 AM · Machine-Learning-Team (Active Tasks), Lift-Wing
elukey added a comment to T307927: Unable to run helmfile and check pods.

Kevin and Aiko's users are now in the deployment POSIX group, they should be able to deploy now. Let's try do to it before closing the task :)

Fri, May 27, 6:21 AM · Machine-Learning-Team (Active Tasks), Lift-Wing
elukey added a comment to T309371: Gerrit: all patches are being reported as merge conflicts.

On contint1001 I see the following in /var/log/zuul/merger-debug.log:

Fri, May 27, 6:15 AM · Release-Engineering-Team, User-DannyS712, Continuous-Integration-Infrastructure
elukey committed rMLIS2c14be4bc146: articlequality: update dependencies to use revscoring 2.11.4 (authored by elukey).
articlequality: update dependencies to use revscoring 2.11.4
Fri, May 27, 4:36 AM
elukey committed rMLIS0d46cb0ef006: draftquality: update dependencies to use revscoring 2.11.4 (authored by elukey).
draftquality: update dependencies to use revscoring 2.11.4
Fri, May 27, 4:36 AM

Thu, May 26

elukey committed rODQ2b7cb9d20a3d: Increments version to 0.0.3 (authored by elukey).
Increments version to 0.0.3
Thu, May 26, 5:15 PM
elukey closed T309205: Release articlequality 0.4.3 and draftquality 0.0.3 to Pypi as Resolved.

https://pypi.org/project/articlequality/0.4.3/
https://pypi.org/project/draftquality/0.0.3/

Thu, May 26, 2:23 PM · Machine-Learning-Team
elukey closed T309205: Release articlequality 0.4.3 and draftquality 0.0.3 to Pypi, a subtask of T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing, as Resolved.
Thu, May 26, 2:23 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks)
elukey added a comment to T309205: Release articlequality 0.4.3 and draftquality 0.0.3 to Pypi.

quick note - these repositories are the same ones used in the ores-deploy gerrit repo. Usually we mirror code from github to gerrit, and we deploy it to ores. Since what we are doing is a changelog/version bump, we can skip (in my opinion) the deployment to ORES. We are not touching any model binary etc.. so it is safe to skip an update.

Thu, May 26, 1:21 PM · Machine-Learning-Team
elukey added a comment to T309205: Release articlequality 0.4.3 and draftquality 0.0.3 to Pypi.

Created https://github.com/wikimedia/draftquality/pull/43https://github.com/wikimedia/draftquality/pull/43 for drafquality!

Thu, May 26, 1:09 PM · Machine-Learning-Team
elukey updated the task description for T309205: Release articlequality 0.4.3 and draftquality 0.0.3 to Pypi.
Thu, May 26, 12:53 PM · Machine-Learning-Team
elukey renamed T309205: Release articlequality 0.4.3 and draftquality 0.0.3 to Pypi from Release articlequality 0.4.3 to Pypi to Release articlequality 0.4.3 and draftquality 0.0.3 to Pypi.
Thu, May 26, 12:53 PM · Machine-Learning-Team

Wed, May 25

elukey moved T302232: Set up the ml-cache clusters from In Progress to Blocked on the Machine-Learning-Team (Active Tasks) board.
Wed, May 25, 3:00 PM · Patch-For-Review, Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)
elukey claimed T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing.
Wed, May 25, 2:59 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks)
elukey moved T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing from Backlog to In Progress on the Machine-Learning-Team (Active Tasks) board.
Wed, May 25, 2:59 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks)
elukey added a comment to T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing.

For articlequality, we need to first solve T309205 to release the new version of articlequality in Pypi.

Wed, May 25, 2:21 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks)
elukey removed a subtask for T302851: revscoring feature extraction error for wikitext papes in Wikidata : T309205: Release articlequality 0.4.3 and draftquality 0.0.3 to Pypi.
Wed, May 25, 2:20 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks), ORES
elukey added a subtask for T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing: T309205: Release articlequality 0.4.3 and draftquality 0.0.3 to Pypi.
Wed, May 25, 2:20 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks)
elukey edited parent tasks for T309205: Release articlequality 0.4.3 and draftquality 0.0.3 to Pypi, added: T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing; removed: T302851: revscoring feature extraction error for wikitext papes in Wikidata .
Wed, May 25, 2:20 PM · Machine-Learning-Team
elukey added a parent task for T309205: Release articlequality 0.4.3 and draftquality 0.0.3 to Pypi: T302851: revscoring feature extraction error for wikitext papes in Wikidata .
Wed, May 25, 2:19 PM · Machine-Learning-Team
elukey added a subtask for T302851: revscoring feature extraction error for wikitext papes in Wikidata : T309205: Release articlequality 0.4.3 and draftquality 0.0.3 to Pypi.
Wed, May 25, 2:19 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks), ORES
elukey created T309205: Release articlequality 0.4.3 and draftquality 0.0.3 to Pypi.
Wed, May 25, 2:18 PM · Machine-Learning-Team
elukey added a comment to T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing.

The enwiki-goodfaith pod is now running revscoring 2.11.4 and I can successfully get scores without any weird error logged. I'll follow up with a code change to apply it to the other pods.

Wed, May 25, 1:48 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks)
elukey committed rMLIS87da6b595907: editquality: upgrade revscoring to 2.11.4 (authored by elukey).
editquality: upgrade revscoring to 2.11.4
Wed, May 25, 10:16 AM
elukey added a comment to T296173: Load test the Lift Wing cluster.

@achou I am curious about the running processes inside the pod after we use kserve - what do you see if you run ps -aux | grep python and ps -eLf | grep python? My understanding is that every Ray worker should be a python process, in this case it would be very interesting. We currently have some restrictions for memory/cpu of every pod in production, so we probably have to tune settings for this use case. For example, with 2 ray workers, I'd expect to see:

Wed, May 25, 7:38 AM · Lift-Wing, Machine-Learning-Team (Active Tasks)
elukey added a comment to T302232: Set up the ml-cache clusters.

To keep archives happy - I am having a chat with Eric over email about this cluster and its future usage. The AQS cassandra cluster should become a multi-tenant/dc cluster able to support various use cases, so we need to decide if ml-cache is a valid use case for a standalone cluster or not.

Wed, May 25, 7:11 AM · Patch-For-Review, Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)
elukey created T309162: Remove old scap repositories from deploy1002.
Wed, May 25, 6:50 AM · SRE, SRE-OnFire, Sustainability, Release-Engineering-Team
elukey moved T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing from In Progress to Backlog on the Machine-Learning-Team (Active Tasks) board.
Wed, May 25, 6:17 AM · Patch-For-Review, Machine-Learning-Team (Active Tasks)
elukey placed T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing up for grabs.
Wed, May 25, 6:17 AM · Patch-For-Review, Machine-Learning-Team (Active Tasks)
elukey assigned T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing to achou.
Wed, May 25, 6:17 AM · Patch-For-Review, Machine-Learning-Team (Active Tasks)
elukey moved T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing from Backlog to In Progress on the Machine-Learning-Team (Active Tasks) board.
Wed, May 25, 6:16 AM · Patch-For-Review, Machine-Learning-Team (Active Tasks)
elukey added a comment to T302851: revscoring feature extraction error for wikitext papes in Wikidata .

Yep sorry forgot a few details, nice :) Before finishing let's expand https://wikitech.wikimedia.org/wiki/ORES/Deployment#Deploy_to_the_test_server with the steps to follow!

Wed, May 25, 6:16 AM · Patch-For-Review, Machine-Learning-Team (Active Tasks), ORES

Tue, May 24

elukey moved T302232: Set up the ml-cache clusters from Backlog to In Progress on the Machine-Learning-Team (Active Tasks) board.
Tue, May 24, 4:50 PM · Patch-For-Review, Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)
elukey moved T307927: Unable to run helmfile and check pods from Backlog to Blocked on the Machine-Learning-Team (Active Tasks) board.
Tue, May 24, 4:50 PM · Machine-Learning-Team (Active Tasks), Lift-Wing
elukey closed T281495: Restructure ORES labs redis puppet role as Resolved.

This has been solved with https://gerrit.wikimedia.org/r/c/operations/puppet/+/785111 in theory, closing the task.

Tue, May 24, 4:50 PM · Infrastructure-Foundations, Puppet, Machine-Learning-Team, ORES
elukey added a comment to T295661: Upgrade ROCm to 4.5.

Time flies and both ROCm and tensorflow-io got several releases.

Tue, May 24, 4:47 PM · Analytics-Radar, Patch-For-Review, Machine-Learning-Team
elukey moved T307927: Unable to run helmfile and check pods from Unorganized to Active Tasks on the Machine-Learning-Team board.
Tue, May 24, 4:43 PM · Machine-Learning-Team (Active Tasks), Lift-Wing
elukey moved T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing from Unorganized to Active Tasks on the Machine-Learning-Team board.
Tue, May 24, 4:42 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks)
elukey added a comment to T302851: revscoring feature extraction error for wikitext papes in Wikidata .

Next steps:

Tue, May 24, 4:38 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks), ORES
achou awarded T304063: Revscoring library branching proposal a Cup of Joe token.
Tue, May 24, 1:38 PM · Machine-Learning-Team
elukey created T309102: Bump revscoring to 2.11.4 on our Docker images for Lift Wing.
Tue, May 24, 1:18 PM · Patch-For-Review, Machine-Learning-Team (Active Tasks)
elukey closed T304063: Revscoring library branching proposal as Declined.

Coming back to this after T303801. We migrated ORES to Debian Buster and Python 3.7, updating wheels and dependencies. The revscoring library was fully compatible with the new set up, and it is not working like a charm.

Tue, May 24, 1:06 PM · Machine-Learning-Team
elukey added a comment to T300130: Move Kafka logging to the new intermediate PKI.

@colewhite hi! There is no rush at the moment of course, but I am wondering what remaining clients needed to be migrated before being able to switch the broker's TLS certs to PKI.

Tue, May 24, 12:53 PM · Patch-For-Review, observability, SRE
elukey closed T296982: Move kafka clusters to fixed uid/gid as Resolved.

Change is rolled out everywhere, and now we have sane defaults in profile::kafka::broker.

Tue, May 24, 12:52 PM · Patch-For-Review, Data-Engineering, serviceops
elukey closed T296982: Move kafka clusters to fixed uid/gid, a subtask of T296641: Upgrade kafka-main nodes to buster, as Resolved.
Tue, May 24, 12:52 PM · Patch-For-Review, serviceops
elukey added a comment to T296173: Load test the Lift Wing cluster.

I read the kserve docs: https://kserve.github.io/website/master/modelserving/v1beta1/custom/custom_model/#parallel-inference

There are two ways to run parallel inference:

  • tune the workers parameter for the Tornado's httpserver
  • use RayServe to deploy ray workers

The first option is not working for our current kserve version 0.7.0 because there is a bug. Fixed in kserve 0.8.0.

Tue, May 24, 12:04 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)

Mon, May 23

elukey added a comment to T296173: Load test the Lift Wing cluster.

@elukey It seems we don't use async right now. We can try to use coroutines to preprocess and see if it would improve performance. Also writing test jobs for revscoring models (like the test_server.py you pasted) would be something good to do.

Mon, May 23, 2:44 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
elukey created T309006: deployment-kafka-jumbo-5 in deployment-prep without role.
Mon, May 23, 9:49 AM · Data-Engineering, Beta-Cluster-Infrastructure
elukey added a comment to T296982: Move kafka clusters to fixed uid/gid.

The three kafka clusters in deployment-prep are using the new uid/gid, before turning the profile::kafka::broker::use_fixed_uid_gid option true by default I'll follow up with SRE to verify that no other cluster is left to move.

Mon, May 23, 8:52 AM · Patch-For-Review, Data-Engineering, serviceops
elukey committed rLPRI882164f311c0: Rename ml::cache hiera config after role rename (authored by elukey).
Rename ml::cache hiera config after role rename
Mon, May 23, 7:45 AM

Fri, May 20

elukey added a comment to T296173: Load test the Lift Wing cluster.

Keeping a note about https://github.com/kserve/kserve/blob/release-0.7/python/kserve/kserve/kfmodel.py#L52

Fri, May 20, 9:18 AM · Lift-Wing, Machine-Learning-Team (Active Tasks)
elukey added a comment to T296173: Load test the Lift Wing cluster.

Things to do (in my opinion):

Fri, May 20, 9:02 AM · Lift-Wing, Machine-Learning-Team (Active Tasks)
elukey claimed T302232: Set up the ml-cache clusters.
Fri, May 20, 8:53 AM · Patch-For-Review, Epic, Lift-Wing, Machine-Learning-Team (Active Tasks)
elukey committed rLPRI3f5e1a7292ff: Add fake secret for the new ML Cassandra cluster (authored by elukey).
Add fake secret for the new ML Cassandra cluster
Fri, May 20, 8:49 AM
elukey moved T308308: Requesting access to the deployment POSIX group for aikochou and kevinbazira from In Discussion to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Fri, May 20, 6:49 AM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
elukey moved T308308: Requesting access to the deployment POSIX group for aikochou and kevinbazira from Awaiting User Input to In Discussion on the SRE-Access-Requests board.
Fri, May 20, 6:48 AM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
elukey updated subscribers of T308308: Requesting access to the deployment POSIX group for aikochou and kevinbazira.

@thcipriani Hi! When you have a moment, could you please review this request and let me know if it is a good use case for deployment ? Thanks :)

Fri, May 20, 6:48 AM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests

Thu, May 19

elukey added a comment to T296173: Load test the Lift Wing cluster.

I was able to use wrk, very interesting tool installed on deploy1002. We can use lua scripts like the following:

Thu, May 19, 2:01 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
elukey added a comment to T296173: Load test the Lift Wing cluster.

It is very weird, siege supports HTTP/1.1 but I see the following:

Thu, May 19, 9:48 AM · Lift-Wing, Machine-Learning-Team (Active Tasks)
elukey added a comment to T296173: Load test the Lift Wing cluster.

Interesting discovery - it seems that my previous tests with ab and siege used http/1.0, not 1.1, and the responses from istio where all 426 upgrade required (so not really representative of HTTP traffic handled by a single pod). The ab tool seems not ready for http 1.1 yet, siege should support it but I am using the version on deploy1002 that could be outdated.

Thu, May 19, 8:45 AM · Lift-Wing, Machine-Learning-Team (Active Tasks)
elukey added a comment to T302851: revscoring feature extraction error for wikitext papes in Wikidata .

Published https://pypi.org/project/revscoring/2.11.4/ from a python 3.7 environment (just to be extra sure). The size of the wheel seems to be the same of 2.11.2, so probably some compression changes for Python 3.7 happened (the 2.11.1 version, IIRC, was the last one released with Python 3.5).

Thu, May 19, 7:51 AM · Patch-For-Review, Machine-Learning-Team (Active Tasks), ORES

Wed, May 18

elukey added a comment to T296173: Load test the Lift Wing cluster.

After a chat with the team, we decided to keep the tornado workers setting to 1 (default), and try the auto-scaling features offered by Knative (min/max replicas etc..).

Wed, May 18, 3:49 PM · Lift-Wing, Machine-Learning-Team (Active Tasks)
elukey added a comment to T288470: Replace cassandra-ca-manager with cergen .

The tricky bit is making sure that clients support the Root PKI CA, but I agree that it would be a great improvement for Cassandra!

Wed, May 18, 2:55 PM · Platform Team Workboards (Platform Engineering Reliability), Cassandra
elukey closed T308418: Add missing failure domain labels to ml-serve-* clusters, a subtask of T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches, as Resolved.
Wed, May 18, 2:09 PM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops
elukey closed T308418: Add missing failure domain labels to ml-serve-* clusters as Resolved.
Wed, May 18, 2:09 PM · Machine-Learning-Team (Active Tasks)
elukey added a comment to T302851: revscoring feature extraction error for wikitext papes in Wikidata .

2.11.3 is live, I have also created https://wikitech.wikimedia.org/wiki/ORES/Deployment#Update_revscoring_in_PyPI.

Wed, May 18, 8:28 AM · Patch-For-Review, Machine-Learning-Team (Active Tasks), ORES
elukey added a comment to T308418: Add missing failure domain labels to ml-serve-* clusters.
root@deploy1002:~# kubectl label nodes ml-serve-ctrl2001.codfw.wmnet node-role.kubernetes.io/master=""
node/ml-serve-ctrl2001.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve-ctrl2002.codfw.wmnet node-role.kubernetes.io/master=""
node/ml-serve-ctrl2002.codfw.wmnet labeled
Wed, May 18, 7:35 AM · Machine-Learning-Team (Active Tasks)
elukey added a comment to T308418: Add missing failure domain labels to ml-serve-* clusters.
root@deploy1002:~# kubectl label nodes ml-serve2001.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2001.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2002.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2002.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2003.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2003.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2004.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2004.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2005.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2005.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2006.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2006.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2007.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2007.codfw.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve2008.codfw.wmnet failure-domain.beta.kubernetes.io/region=codfw
node/ml-serve2008.codfw.wmnet labeled
Wed, May 18, 7:23 AM · Machine-Learning-Team (Active Tasks)
elukey added a comment to T308308: Requesting access to the deployment POSIX group for aikochou and kevinbazira.

Resetting the task to open, since I think that Kevin and Aiko should end up in the deployment group. They will not need all the sudo capabilities for MediaWiki etc.., but as far as I can see the group is already composed by people that don't need it as well. We'll probably need to segment deployment further in the future, we'll see :)

Wed, May 18, 7:09 AM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
elukey changed the status of T308308: Requesting access to the deployment POSIX group for aikochou and kevinbazira from Stalled to Open.
Wed, May 18, 7:07 AM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
elukey updated the task description for T308308: Requesting access to the deployment POSIX group for aikochou and kevinbazira.
Wed, May 18, 7:06 AM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests

Tue, May 17

elukey updated the task description for T308308: Requesting access to the deployment POSIX group for aikochou and kevinbazira.
Tue, May 17, 2:37 PM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
elukey renamed T308308: Requesting access to the deployment POSIX group for aikochou and kevinbazira from Add Aiko and Kevin to the deployment posix group to Requesting access to the deployment POSIX group for aikochou and kevinbazira.
Tue, May 17, 2:28 PM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
elukey claimed T308418: Add missing failure domain labels to ml-serve-* clusters.
Tue, May 17, 2:19 PM · Machine-Learning-Team (Active Tasks)
elukey moved T308418: Add missing failure domain labels to ml-serve-* clusters from Backlog to In Progress on the Machine-Learning-Team (Active Tasks) board.
Tue, May 17, 2:18 PM · Machine-Learning-Team (Active Tasks)
elukey moved T308308: Requesting access to the deployment POSIX group for aikochou and kevinbazira from Backlog to Blocked on the Machine-Learning-Team (Active Tasks) board.
Tue, May 17, 2:18 PM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
elukey moved T308308: Requesting access to the deployment POSIX group for aikochou and kevinbazira from Unorganized to Active Tasks on the Machine-Learning-Team board.
Tue, May 17, 2:11 PM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
elukey moved T308418: Add missing failure domain labels to ml-serve-* clusters from Unorganized to Active Tasks on the Machine-Learning-Team board.
Tue, May 17, 2:11 PM · Machine-Learning-Team (Active Tasks)
elukey added a comment to T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.

I merged two changes for the ml-serve-eqiad cluster, and now the concerns expressed in T306649#7881940 should be gone:

Tue, May 17, 1:59 PM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops

Mon, May 16

elukey added a comment to T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.

Added the proposed node labels to ml-serve-eqiad via T308418#7930118. At this point I'll wait to see what strategy is best to pick between GlobalNetworkSet and fake nodes, and then we'll be able to test on ml-serve.

Mon, May 16, 11:40 AM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops
elukey added a comment to T307762: Puppet broken on deploy03.

Nice catch, TIL about wmflib::resource_hosts, thanks John!

Mon, May 16, 11:12 AM · Beta-Cluster-Infrastructure
elukey added a comment to T302851: revscoring feature extraction error for wikitext papes in Wikidata .

I realized that the copy of the revscoring repository from which I published 2.11.2 may not have had the correct commit from Aiko, so I created https://github.com/wikimedia/revscoring/pull/520 to release 2.11.3 and be sure. Sorry for the trouble, I'll update the docs once done.

Mon, May 16, 11:11 AM · Patch-For-Review, Machine-Learning-Team (Active Tasks), ORES
elukey added a comment to T308418: Add missing failure domain labels to ml-serve-* clusters.
root@deploy1002:~# kubectl label nodes ml-serve1001.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1001.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1002.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1002.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1003.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1003.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1004.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1004.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1005.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1005.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1006.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1006.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1007.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1007.eqiad.wmnet labeled
root@deploy1002:~# kubectl label nodes ml-serve1008.eqiad.wmnet failure-domain.beta.kubernetes.io/region=eqiad
node/ml-serve1008.eqiad.wmnet labeled
Mon, May 16, 8:05 AM · Machine-Learning-Team (Active Tasks)
elukey created T308418: Add missing failure domain labels to ml-serve-* clusters.
Mon, May 16, 7:57 AM · Machine-Learning-Team (Active Tasks)
elukey updated subscribers of T307762: Puppet broken on deploy03.

This may be due to T303559 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/771441, @jbond do you have any idea?

Mon, May 16, 7:50 AM · Beta-Cluster-Infrastructure
elukey added a comment to T307762: Puppet broken on deploy03.

Tried to recheck, afaics wmflib::resource_hosts is called by profile::scap::dsh, but I cannot reach puppetdb03 from deploy03 (the firewall rules on puppetdb03 confirm what I am seeing, no rule to allow traffic from deploy03 to port 443 afaics).

Mon, May 16, 7:46 AM · Beta-Cluster-Infrastructure

Fri, May 13

elukey added a comment to T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.

If the idea is ok, I'd propose to use labels like wikimedia.org/node-location == lsw1-f3-eqiad as it was mentioned beforehand. If you don't like the idea tell me what you prefer, no strong opinions :)

I am fine with whatever for experimentation, provided they don't stick around.

Down the road, we already have

failure-domain.beta.kubernetes.io/region: eqiad
failure-domain.beta.kubernetes.io/zone: row-d

for all wikikube nodes. These aren't nicely automated yet (which actually makes it flexible right now that we are discussing this) but rather just some yaml data under hosts/kubernetes*.yaml. I don't see ml-serve having any yet, which means we can just add them right now.

Those are well known and standardized [1]. They are also deprecated and meant to be replaced by topology.kubernetes.io/zone and topology.kubernetes.io/region , tracked in T270191.
I 'd much rather we ended up with the latter ones for this eventually and not something that we invented. region and zone anyway map nicely to DC and L2 equipment in both the legacy and the e/f row cases in my mind.

[1] https://kubernetes.io/docs/reference/labels-annotations-taints

Fri, May 13, 1:36 PM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops
elukey added a comment to T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users.

Another use case mentioned in T307927#7921020 is that, IIUC, the /etc/helmfile/private config files changed group ownership as well, impacting ml deployers. For example, Aiko was able to set a home-local HELM_REPOSITORY_CACHE successfully, but then helmfile diff prompted the removal of a Secret due to some private files not readable anymore.

Fri, May 13, 1:25 PM · Release-Engineering-Team (Radar), Patch-For-Review, Kubernetes, MW-on-K8s, serviceops
elukey added a comment to T308308: Requesting access to the deployment POSIX group for aikochou and kevinbazira.

I may have created this task too soon, some discussion on T305729 is still happening, let's wait before proceeding.

Fri, May 13, 9:01 AM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests
elukey reopened T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users as "Open".

Reopening since it seems that more discussion is needed :)

Fri, May 13, 9:00 AM · Release-Engineering-Team (Radar), Patch-For-Review, Kubernetes, MW-on-K8s, serviceops
elukey reopened T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users, a subtask of T302539: Deploy MediaWiki images for kubernetes from the deployment servers, as Open.
Fri, May 13, 9:00 AM · Release-Engineering-Team (Radar), serviceops, MW-on-K8s, Scap
elukey created T308308: Requesting access to the deployment POSIX group for aikochou and kevinbazira.
Fri, May 13, 8:46 AM · Machine-Learning-Team (Active Tasks), SRE, SRE-Access-Requests

Thu, May 12

elukey added a comment to T306649: Agree strategy for Kubernetes BGP peering to top-of-rack switches.

Quick question about how to proceed. Would it make sense to start testing adding manual labels in the ml-serve-eqiad cluster (since we have new E/F nodes there) to see if everything works as expected etc..? After this verification we could start thinking about how/where to get the node label info, and how to better share/maintain the calico bgp configs etc..

Thu, May 12, 7:27 AM · Prod-Kubernetes, SRE, Infrastructure-Foundations, netops

Wed, May 11

elukey added a comment to T307927: Unable to run helmfile and check pods.

We just found the above workaround may cause an issue that the swift-s3-credentials Secret resource got removed for some reason:

aikochou@deploy1002:/srv/deployment-charts/helmfile.d/ml-services/revscoring-articlequality$ helmfile -e ml-serve-codfw diff                
skipping missing values file matching "/etc/helmfile-defaults/private/ml-serve_services/revscoring-articlequality/ml-serve-codfw.yaml"      
skipping missing values file matching "/etc/helmfile-defaults/private/ml-serve_services/revscoring-articlequality/ml-serve-codfw.yaml"
Wed, May 11, 2:07 PM · Machine-Learning-Team (Active Tasks), Lift-Wing
elukey added a comment to T303801: Upgrade ORES to Debian Buster.

https://netbox.wikimedia.org/api/extras/job-results/3032166/ worked, so maybe some race condition with timing.

Wed, May 11, 8:52 AM · Machine-Learning-Team (Active Tasks), Patch-For-Review
elukey added a comment to T305729: Kubernetes credentials on deployment servers should be available to deployers, not all users.

I confirm from Aiko's tests that HELM_CACHE_HOME is the problem, so we can try to set it differently for various groups. From what I can see in puppet, it could be as simple as:

Wed, May 11, 8:45 AM · Release-Engineering-Team (Radar), Patch-For-Review, Kubernetes, MW-on-K8s, serviceops
elukey added a comment to T308102: Delete Cloud VPS projects ores and ores-staging.

Ah snap I thought it was deleted, yes please all can be cleaned up!

Wed, May 11, 7:53 AM · Cloud-VPS (Project-requests), cloud-services-team (Kanban)