Page MenuHomePhabricator

elukey (Luca Toscano)
Site Reliability Engineer - Analytics/Data engineering

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Jan 5 2016, 9:54 PM (375 w, 6 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
LToscano (WMF) [ Global Accounts ]

Recent Activity

Today

elukey added a comment to T310980: Allow Cassandra to be deployed on Bullseye nodes.

@MoritzMuehlenhoff given how simple this use case is, I'd just avoid to keep track of the whole cassandra upstream branch in the new repo, to just have one main branch with the debian config and the .py files copied over in the right places. Does it make sense or do you prefer something more elaborate?

Tue, Mar 21, 8:44 AM · Cassandra, SRE
elukey added a comment to T310980: Allow Cassandra to be deployed on Bullseye nodes.

Requested the creation of operations/debs/cqlsh4 in https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests

Tue, Mar 21, 8:26 AM · Cassandra, SRE
elukey added a comment to T332013: Migrate kafka-main to bullseye.

Next steps:

Tue, Mar 21, 8:02 AM · serviceops
elukey moved T332392: Update revert-risk multilingual model from Unsorted to In Progress on the Machine-Learning-Team board.
Tue, Mar 21, 7:40 AM · Machine-Learning-Team, Lift-Wing

Yesterday

elukey moved T325763: Review ORES traffic to better understand Lift Wing's requirements from In Progress to Done on the Machine-Learning-Team board.
Mon, Mar 20, 2:43 PM · Machine-Learning-Team
elukey created T332602: Investigate if/how to enable the swagger UI for InferenceService resources.
Mon, Mar 20, 2:40 PM · Machine-Learning-Team
elukey added a comment to T332013: Migrate kafka-main to bullseye.

Had a chat with Joe, the idea is to have one node reimaged (so that we can confirm that everything works etc..) leaving the rest of the cluster(s) untouched. I think that moving to PKI is not doable, there are still clients using the puppet CA bundle only, so scratch my proposal above.

Mon, Mar 20, 2:21 PM · serviceops
elukey added a comment to T325759: Add documentation about LiftWing to the API Portal.

Added docs pages!! To avoid mixing up and complicating them, I decided to:

  • Have a dedicated page for each revscoring model, even if they share a lot.
  • Have a dedicated page for the Article Topic Outlink model (different from the above ones).
Mon, Mar 20, 11:22 AM · API-Portal, Machine-Learning-Team

Fri, Mar 17

elukey added a comment to T310980: Allow Cassandra to be deployed on Bullseye nodes.

okok this is the part that I wasn't unclear about - we'd just deploy cqlsh in another way, like via puppet, and leverage the /usr/local precedence right? If so this could be something to hack next week :)

Fri, Mar 17, 3:41 PM · Cassandra, SRE
elukey added a comment to T332013: Migrate kafka-main to bullseye.

Moreover it would be really great to couple this task with T319372, if possible, so that every new reimage will start from PKI directly.

Fri, Mar 17, 3:01 PM · serviceops
elukey updated subscribers of T332013: Migrate kafka-main to bullseye.

These hosts are delicate, they run the MediaWiki job queues :) We can take down a node but it is very important to preserve the /srv partition to avoid kafka to get all the data back from other brokers.

Fri, Mar 17, 2:52 PM · serviceops
elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.

Opened https://github.com/benthosdev/benthos/issues/1806 to upstream.

Fri, Mar 17, 11:32 AM · User-fgiunchedi, SRE Observability, SRE
elukey added a comment to T310980: Allow Cassandra to be deployed on Bullseye nodes.

Getting back to this after a while, since we now need to move to Bullseye. The last blocker is cqlsh running on py2 only, so what if we keep our version of Cassandra but we upgrade its pylib only?

Fri, Mar 17, 10:30 AM · Cassandra, SRE
elukey added a comment to T313814: Upgrade to Cassandra 4.x.

ping again @Eevans :)

Fri, Mar 17, 10:22 AM · Cassandra
elukey moved T269171: Create documentation for a workflow for evaluating models submitted for deployment. from Backlog/Ethical ML to Backlog Q4 on the Machine-Learning-Team board.
Fri, Mar 17, 10:14 AM · Machine-Learning-Team
elukey added a comment to T308164: Migrate Content Translation Recommendation API to Lift Wing.

@calbon @kevinbazira should we keep this task open? If so, what are the next steps and/or subtasks?

Fri, Mar 17, 10:02 AM · Language-Team, Machine-Learning-Team, Epic
elukey closed T299664: ORES deployment repos not mirroring regular git changes anymore as Declined.

We are moving away from git to store models with Lift Wing (in favor of Swift).

Fri, Mar 17, 9:55 AM · Machine-Learning-Team, ORES
elukey moved T305447: Automate the procedure to bootstrap minikube on the ML-Sandbox and to share it by multiple users from Backlog/SRE to Backlog Q4 on the Machine-Learning-Team board.
Fri, Mar 17, 9:54 AM · Machine-Learning-Team
elukey moved T275896: Review ROCm deployment procedures and current packages from Backlog/SRE to Backlog Q4 on the Machine-Learning-Team board.
Fri, Mar 17, 9:54 AM · Data-Engineering-Icebox, Analytics-Radar, Machine-Learning-Team
elukey closed T281713: Review pre-cached wikis for ORES as Declined.

We are deprecating ORES as part of T312518.

Fri, Mar 17, 9:53 AM · Machine-Learning-Team, ORES
elukey moved T295661: Upgrade ROCm to 4.5 from Backlog/SRE to Backlog Q4 on the Machine-Learning-Team board.
Fri, Mar 17, 9:53 AM · Data-Engineering-Icebox, Analytics-Radar, Patch-For-Review, Machine-Learning-Team
elukey added a comment to T295661: Upgrade ROCm to 4.5.

Upstream already reached 5.x, we should probably upgrade to a more recent version as well to keep up and have better support (especially if we want to support more up-to-date GPUs).

Fri, Mar 17, 9:52 AM · Data-Engineering-Icebox, Analytics-Radar, Patch-For-Review, Machine-Learning-Team
elukey moved T278083: Define SLIs/SLOs for link recommendation service from Backlog/SRE to Unsorted on the Machine-Learning-Team board.
Fri, Mar 17, 9:50 AM · Growth-Team, Machine-Learning-Team, Add-Link
elukey closed T324467: Add monitoring+alerting for NLLB200 AWS service as Declined.

Closing this task since we are hopefully moving off AWS for this model. We can re-open the task in case it will be needed in the future.

Fri, Mar 17, 9:49 AM · Wikimedia Enterprise, Machine-Learning-Team, ContentTranslation
elukey closed T324467: Add monitoring+alerting for NLLB200 AWS service, a subtask of T321781: Run NLLB-200 model in a new instance, as Declined.
Fri, Mar 17, 9:49 AM · Wikimedia Enterprise, Machine-Learning-Team, ContentTranslation
elukey closed T324468: Write/polish documentation for NLLb200 on AWS as Declined.

Closing this task since we are hopefully moving off AWS for this model. We can re-open the task in case it will be needed in the future.

Fri, Mar 17, 9:48 AM · Wikimedia Enterprise, Machine-Learning-Team
elukey closed T324468: Write/polish documentation for NLLb200 on AWS, a subtask of T321781: Run NLLB-200 model in a new instance, as Declined.
Fri, Mar 17, 9:48 AM · Wikimedia Enterprise, Machine-Learning-Team, ContentTranslation
elukey moved T250110: New Service Request 'open_nsfw' from In Progress to Unsorted on the Machine-Learning-Team board.
Fri, Mar 17, 9:47 AM · serviceops-radar, Machine-Learning-Team, artificial-intelligence, SRE
elukey moved T325577: Add language support for Esperanto (eo) from In Progress to Backlog/Revscoring on the Machine-Learning-Team board.
Fri, Mar 17, 9:46 AM · artificial-intelligence, Bad-Words-Detection-System, Machine-Learning-Team, revscoring
elukey moved T330346: Detection and flagging of articles that are AI/LLM-generated from In Progress to Backlog WikiGPT on the Machine-Learning-Team board.
Fri, Mar 17, 9:46 AM · Machine-Learning-Team, Growth-Team, PageTriage
elukey added a comment to T250110: New Service Request 'open_nsfw'.

This task needs a bit more clarification, we already have an experimental model server for nsfw content. Putting back in "Unsorted" status so the ML team can re-asses the work to be done.

Fri, Mar 17, 9:46 AM · serviceops-radar, Machine-Learning-Team, artificial-intelligence, SRE
elukey moved T325483: Add language support for Serbo-Croatian from In Progress to Backlog/Revscoring on the Machine-Learning-Team board.
Fri, Mar 17, 9:46 AM · artificial-intelligence, Machine-Learning-Team, Bad-Words-Detection-System, revscoring
elukey placed T250110: New Service Request 'open_nsfw' up for grabs.
Fri, Mar 17, 9:44 AM · serviceops-radar, Machine-Learning-Team, artificial-intelligence, SRE
elukey moved T325316: Productionize section alignment model training from In Progress to Backlog Q4 on the Machine-Learning-Team board.
Fri, Mar 17, 9:44 AM · Section-Level-Image-Suggestions, Machine-Learning-Team, Research-Backlog, Structured-Data-Backlog
elukey placed T325316: Productionize section alignment model training up for grabs.
Fri, Mar 17, 9:43 AM · Section-Level-Image-Suggestions, Machine-Learning-Team, Research-Backlog, Structured-Data-Backlog
elukey moved T312776: Add language support for Cantonese (yue) from In Progress to Backlog/Revscoring on the Machine-Learning-Team board.
Fri, Mar 17, 9:42 AM · Machine-Learning-Team, artificial-intelligence, Bad-Words-Detection-System, revscoring
elukey moved T328494: WikiGPT Experiment from Blocked to Backlog WikiGPT on the Machine-Learning-Team board.
Fri, Mar 17, 9:41 AM · Epic, Machine-Learning-Team
elukey moved T329528: Fix WikiGPT copy link feature mobile view from Blocked to Backlog WikiGPT on the Machine-Learning-Team board.
Fri, Mar 17, 9:41 AM · Machine-Learning-Team
elukey moved T329016: [WikiGPT] Improve search results of WikiGPT from Blocked to Backlog WikiGPT on the Machine-Learning-Team board.
Fri, Mar 17, 9:41 AM · Machine-Learning-Team
elukey moved T324468: Write/polish documentation for NLLb200 on AWS from Blocked to Backlog/SRE on the Machine-Learning-Team board.
Fri, Mar 17, 9:40 AM · Wikimedia Enterprise, Machine-Learning-Team
elukey moved T324467: Add monitoring+alerting for NLLB200 AWS service from Blocked to Backlog/SRE on the Machine-Learning-Team board.
Fri, Mar 17, 9:40 AM · Wikimedia Enterprise, Machine-Learning-Team, ContentTranslation
elukey moved T327923: Investigate procuring and installing two GPUs on Lift Wing from Blocked to Backlog Q4 on the Machine-Learning-Team board.
Fri, Mar 17, 9:40 AM · Machine-Learning-Team
elukey placed T327923: Investigate procuring and installing two GPUs on Lift Wing up for grabs.
Fri, Mar 17, 9:40 AM · Machine-Learning-Team
elukey added a comment to T332200: Migrate ORES MediaWiki Extension to LiftWing.

This is probably a duplicate of https://phabricator.wikimedia.org/T319170, let's decide what ticket to keep open and close the other one :)

Fri, Mar 17, 9:33 AM · ORES, MediaWiki-extensions-ORES, Machine-Learning-Team
elukey added a comment to T325759: Add documentation about LiftWing to the API Portal.

Since it was asked over IRC: Lift Wing documentation can be found in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Usage, this task is to copy the same information to the API portal :)

Fri, Mar 17, 7:48 AM · API-Portal, Machine-Learning-Team

Tue, Mar 14

elukey closed T325218: Deploy revert-risk multilingual model to production as Resolved.
Tue, Mar 14, 2:59 PM · Machine-Learning-Team, Lift-Wing
elukey closed T331045: EnWiki Recent Changes Page no longer displays damaging filters as Resolved.
Tue, Mar 14, 2:59 PM · MediaWiki-extensions-ORES, Machine-Learning-Team
elukey closed T331513: Delete old ml-related docker images that are deprecated as Resolved.
Tue, Mar 14, 2:59 PM · Machine-Learning-Team
elukey closed T329032: Upgrade the inference-services repo codebase to kserve 0.10 (fastapi) as Resolved.
Tue, Mar 14, 2:59 PM · Machine-Learning-Team
elukey closed T331114: Upgrade Kserve's k8s control plane to 0.10 as Resolved.
Tue, Mar 14, 2:59 PM · Machine-Learning-Team
elukey closed T324542: Upgrade ML clusters to Kubernetes 1.23, a subtask of T307943: Update Kubernetes clusters to v1.23, as Resolved.
Tue, Mar 14, 2:59 PM · Foundational Technology Requests, Shared-Data-Infrastructure, Patch-For-Review, Kubernetes, Prod-Kubernetes, serviceops
elukey closed T324542: Upgrade ML clusters to Kubernetes 1.23 as Resolved.
Tue, Mar 14, 2:59 PM · Machine-Learning-Team
elukey closed T331547: API-Gateway: lift auth restriction for POST requests as Resolved.
Tue, Mar 14, 2:59 PM · Platform Team Workboards (Platform Engineering Reliability), Machine-Learning-Team, Platform Team Initiatives (API Gateway Roadmap), API Platform
elukey moved T328576: Implement new mediawiki.revision-score streams with Lift Wing from Blocked to In Progress on the Machine-Learning-Team board.
Tue, Mar 14, 2:48 PM · Patch-For-Review, Machine-Learning-Team
elukey moved T331547: API-Gateway: lift auth restriction for POST requests from In Progress to Done on the Machine-Learning-Team board.
Tue, Mar 14, 2:43 PM · Platform Team Workboards (Platform Engineering Reliability), Machine-Learning-Team, Platform Team Initiatives (API Gateway Roadmap), API Platform
elukey added a comment to T331968: Let the model that learns section alignments consume section topics output.

Hi! Is there anything that the ML team needs to do? (just to organize the work etc..)

Tue, Mar 14, 2:19 PM · Section-Topics, Machine-Learning-Team, Research, Structured-Data-Backlog
elukey added a comment to T325763: Review ORES traffic to better understand Lift Wing's requirements.

Checked https://grafana-rw.wikimedia.org/d/HIRrxQ6mk/ores?forceLogin&from=now-7d&orgId=1&refresh=1m&to=now-1m&var-datasource=codfw%20prometheus%2Fops&var-model=All&viewPanel=74 and came up with some basic autoscaling numbers. We'll need to refine them as we go of course.

Tue, Mar 14, 10:53 AM · Machine-Learning-Team
elukey claimed T325759: Add documentation about LiftWing to the API Portal.
Tue, Mar 14, 10:12 AM · API-Portal, Machine-Learning-Team
elukey updated the task description for T330165: eqiad row B switches upgrade.
Tue, Mar 14, 10:00 AM · Patch-For-Review, Data Pipelines, Data-Engineering-Planning, DBA, Discovery-Search (Current work), SRE, serviceops, cloud-services-team, Machine-Learning-Team, Platform Engineering, SRE Observability, Infrastructure-Foundations, serviceops-collab, Traffic
elukey updated the task description for T331882: eqiad row C switches upgrade.
Tue, Mar 14, 9:28 AM · Patch-For-Review, serviceops-radar, Discovery-Search (Current work), SRE, DBA, cloud-services-team, Traffic, Infrastructure-Foundations, Machine-Learning-Team, Data-Engineering, serviceops-collab, Platform Engineering, SRE Observability

Mon, Mar 13

elukey added a comment to T330854: Investigate tools that use ORES.

I see. This could increase API round-tripping by 4x times. Is there work underway to support multi-model, multi-rev-ids in lift wing API?

Mon, Mar 13, 4:38 PM · ORES, Machine-Learning-Team, Wikimedia Enterprise
elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.

Me and Filippo tried a ton of workarounds and solutions today, but none of them really worked. In the end we removed the restriction on the first 12 partitions for each webrequest topic (we introduced a limit a while ago to reduce the bw usage) and we started seeing a different behavior from Benthos:

Mon, Mar 13, 3:47 PM · User-fgiunchedi, SRE Observability, SRE
elukey added a comment to T330854: Investigate tools that use ORES.

@elukey Correct. Just wanted to add that if with lift wing, we can pass several rev-ids and models at once, like we in the example above, that would be nice.

Mon, Mar 13, 3:04 PM · ORES, Machine-Learning-Team, Wikimedia Enterprise
elukey added a comment to T330854: Investigate tools that use ORES.

@prabhat thanks! So to recap, if I got it correctly, to migrate away from ORES to Lift Wing you'd need to be able to query goodfaith/damaging model servers on demand (this is already possible), but nothing more right? You don't really use the revision-score stream for anything (only the revision-create one, but that is not controlled by ML and out of the scope for the migration).

Mon, Mar 13, 2:53 PM · ORES, Machine-Learning-Team, Wikimedia Enterprise
elukey added a comment to T325759: Add documentation about LiftWing to the API Portal.

@apaskulin thanks a lot!

Mon, Mar 13, 11:05 AM · API-Portal, Machine-Learning-Team
elukey moved T331547: API-Gateway: lift auth restriction for POST requests from Unsorted to In Progress on the Machine-Learning-Team board.
Mon, Mar 13, 11:05 AM · Platform Team Workboards (Platform Engineering Reliability), Machine-Learning-Team, Platform Team Initiatives (API Gateway Roadmap), API Platform
elukey claimed T331547: API-Gateway: lift auth restriction for POST requests.
Mon, Mar 13, 11:05 AM · Platform Team Workboards (Platform Engineering Reliability), Machine-Learning-Team, Platform Team Initiatives (API Gateway Roadmap), API Platform
elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.

To keep archives happy - in order to be able to delete the consumer group I had to add the following:

Mon, Mar 13, 9:27 AM · User-fgiunchedi, SRE Observability, SRE
elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.

Tried to stop all the consumers on centrallog nodes, delete the consumer group and restart all. Traffic changed and dropped back to previous values, still one third of the events processed.

Mon, Mar 13, 8:49 AM · User-fgiunchedi, SRE Observability, SRE

Sun, Mar 12

elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.

The weird thing is that I keep seeing zero consumers:

Sun, Mar 12, 5:00 PM · User-fgiunchedi, SRE Observability, SRE
elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.

Tried to stop both consumers (benthos systemd units) on centrallog 1002 and 2002, reset again the offsets, start the consumers.

Sun, Mar 12, 4:58 PM · User-fgiunchedi, SRE Observability, SRE
elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.

The traffic handled by benthos is around 1/3 of the original one now (improved but not really ok). I don't see clear indications that Benthos itself is suffering, since it now runs on a better hardware and its config didn't really change.

Sun, Mar 12, 11:14 AM · User-fgiunchedi, SRE Observability, SRE
elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.

Seems better now, from the consumer group's consistency point of view:

Sun, Mar 12, 10:50 AM · User-fgiunchedi, SRE Observability, SRE
elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.
elukey@kafka-jumbo1001:~$ kafka consumer-groups --describe --group benthos-webrequest-sampled-live
kafka-consumer-groups --bootstrap-server kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092,kafka-jumbo1007.eqiad.wmnet:9092,kafka-jumbo1008.eqiad.wmnet:9092,kafka-jumbo1009.eqiad.wmnet:9092 --describe --group benthos-webrequest-sampled-live
Note: This will not show information about old Zookeeper-based consumers.
Consumer group 'benthos-webrequest-sampled-live' has no active members.
Sun, Mar 12, 10:12 AM · User-fgiunchedi, SRE Observability, SRE
elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.

I would try with a consumer group offset reset:

Sun, Mar 12, 9:30 AM · User-fgiunchedi, SRE Observability, SRE
elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.

Re-added 1001 back into Kafka Jumbo's firewall allowed host list, and restarted benthos on it. The traffic volume increased a lot, but then we went back into the only-upload-data state.

Sun, Mar 12, 9:25 AM · User-fgiunchedi, SRE Observability, SRE
elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.

Something is still off, the traffic volume reported by turnilo for live vs batch webrequest data is still different (live a lot less). Something clearly happened when centrallog1001 was firewalled on kafka brokers, I suspect that it didn't have the time to offloads its partitions assignment to the consumer group and something got weird on the Kafka side.

Sun, Mar 12, 8:42 AM · User-fgiunchedi, SRE Observability, SRE
elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.

I see some text data in https://w.wiki/6Rzi, I'll recheck in a bit to see if everything is stable.

Sun, Mar 12, 7:53 AM · User-fgiunchedi, SRE Observability, SRE
elukey added a comment to T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes.

On March 9th ~ 16 UTC there was a severe drop in data ingested by Benthos:

Sun, Mar 12, 7:48 AM · User-fgiunchedi, SRE Observability, SRE

Fri, Mar 10

elukey added a comment to T313814: Upgrade to Cassandra 4.x.

@Eevans we can definitely use the ml-cache clusters to test the upgrade, they are still not used so no problem in making experiments.

Fri, Mar 10, 4:27 PM · Cassandra
elukey added a comment to T330854: Investigate tools that use ORES.

@prabhat Thanks a lot for the explanation! Have you ever checked https://stream.wikimedia.org/v2/ui/#/?streams=mediawiki.revision-score ? It is basically the same thing, but with more scores. At the moment the stream hits ORES for every revision-create event, calculating scores for multiple models (including goodfaith and damaging). We are trying to transform it into more granular streams, like:

Fri, Mar 10, 7:13 AM · ORES, Machine-Learning-Team, Wikimedia Enterprise

Thu, Mar 9

elukey moved T331045: EnWiki Recent Changes Page no longer displays damaging filters from In Progress to Done on the Machine-Learning-Team board.
Thu, Mar 9, 10:45 AM · MediaWiki-extensions-ORES, Machine-Learning-Team
elukey moved T331513: Delete old ml-related docker images that are deprecated from Unsorted to Done on the Machine-Learning-Team board.
Thu, Mar 9, 10:21 AM · Machine-Learning-Team
elukey added a comment to T331513: Delete old ml-related docker images that are deprecated.

Updated, all images that we don't use are gone :)

Thu, Mar 9, 10:20 AM · Machine-Learning-Team
elukey added a comment to T331547: API-Gateway: lift auth restriction for POST requests.

I see that the blocker should be the following in _api_gateway_ratelimit.tpl:

Thu, Mar 9, 10:18 AM · Platform Team Workboards (Platform Engineering Reliability), Machine-Learning-Team, Platform Team Initiatives (API Gateway Roadmap), API Platform
elukey claimed T331513: Delete old ml-related docker images that are deprecated.
Thu, Mar 9, 9:52 AM · Machine-Learning-Team
elukey added a comment to T331513: Delete old ml-related docker images that are deprecated.

Cleaned up from build2001 following Wikitech's docs. Let's wait https://docker-registry.wikimedia.org/ to sync with the new changes before closing :)

Thu, Mar 9, 9:52 AM · Machine-Learning-Team
elukey added a comment to T330854: Investigate tools that use ORES.

@prabhat hi! Do you have some info about how Enterprise uses ORES? More specifically, I see two use cases in the OKAPI repo (not sure if it is the right one or not though):

Thu, Mar 9, 9:23 AM · ORES, Machine-Learning-Team, Wikimedia Enterprise
elukey committed rMLISc9aa3d06d052: revscoring: relax schema version checks when retrieving the rev-id (authored by elukey).
revscoring: relax schema version checks when retrieving the rev-id
Thu, Mar 9, 7:44 AM

Wed, Mar 8

elukey added a comment to T331416: The nsfw model hangs in predict() after moving to Kserve 0.10.

@Htriedman thanks a lot! If you want to test the docker image: https://docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-nsfw/tags/ (the last tag should contain the issue).

Wed, Mar 8, 6:52 PM · Machine-Learning-Team
elukey added a comment to T329071: Integration of Revert Risk Scores to Recent Changes as a filter.

The problem that I see with 1) is that we are already filtering (and rightfully so) a lot of events, meanwhile researchers may want the whole stream scored.

The thing is that it doesn't do it for jobqueue reasons or that it can't. It doesn't do it for storage reasons. It's pretty easy to make ores extension simply queue the job and call ores/liftwing for every edit but just not store it if it doesn't need to. Which you can then make it emit an event for every edit for free. The change is quite minimal.

Wed, Mar 8, 5:27 PM · Data-Engineering-Planning, Event-Platform Value Stream, Machine-Learning-Team, Edit-Review-Improvements-Integrated-Filters, Research, Growth-Team
elukey added a project to T331547: API-Gateway: lift auth restriction for POST requests: Machine-Learning-Team.
Wed, Mar 8, 4:37 PM · Platform Team Workboards (Platform Engineering Reliability), Machine-Learning-Team, Platform Team Initiatives (API Gateway Roadmap), API Platform
elukey created T331547: API-Gateway: lift auth restriction for POST requests.
Wed, Mar 8, 4:37 PM · Platform Team Workboards (Platform Engineering Reliability), Machine-Learning-Team, Platform Team Initiatives (API Gateway Roadmap), API Platform
elukey added a comment to T329071: Integration of Revert Risk Scores to Recent Changes as a filter.

Thanks a lot!

Regarding the jobs, the reason ores ext doesn't trigger a job is not that it can't, it's because it could 1- overwhelm the ores service 2- it could fill the mw mysql tables with crap. The biggest example is 22M edits done monthly in Wikidata that only a very small fraction of them is valuable for ores ext (edits that are not auto-patrolled by mediawiki are needed for patrollers) so the extension simply ignores edits done by auto-patrolled users (including bots) which filters out 99.9% of edits.

Just to understand - the extension does trigger async jobs in the job queue right? IIUC calling ORES and inserting in the DB is not done at edit time, but later on (forgive my ignorance but I'd like to be sure about these things, Mediawiki is not my area of expertise :)

Yes, it's post edit. One of the pillars of mediawiki is to save the edit as soon as possible and build the canonical entry and then triggers massive set of secondary data updates (via deferred updates or jobs) to do wide-range of updates from CDN purge, to ores, to updating search index, etc. This is called "outbox pattern" in the industry. MediaWiki is basically event-driven but not in an obvious way.

Wed, Mar 8, 4:23 PM · Data-Engineering-Planning, Event-Platform Value Stream, Machine-Learning-Team, Edit-Review-Improvements-Integrated-Filters, Research, Growth-Team
elukey moved T329032: Upgrade the inference-services repo codebase to kserve 0.10 (fastapi) from Backlog/Lift Wing to Done on the Machine-Learning-Team board.
Wed, Mar 8, 2:58 PM · Machine-Learning-Team
elukey claimed T329032: Upgrade the inference-services repo codebase to kserve 0.10 (fastapi).
Wed, Mar 8, 2:57 PM · Machine-Learning-Team
elukey added a comment to T329032: Upgrade the inference-services repo codebase to kserve 0.10 (fastapi).

Task completed, all clusters upgraded to kserve 0.10. The nsfw model doesn't work but since it is experimental we'll follow up in T331416

Wed, Mar 8, 2:57 PM · Machine-Learning-Team
elukey moved T331114: Upgrade Kserve's k8s control plane to 0.10 from In Progress to Done on the Machine-Learning-Team board.
Wed, Mar 8, 2:56 PM · Machine-Learning-Team
elukey added a comment to T331114: Upgrade Kserve's k8s control plane to 0.10.

New control plane deployed on all clusters and tested!

Wed, Mar 8, 2:56 PM · Machine-Learning-Team
elukey updated subscribers of T331513: Delete old ml-related docker images that are deprecated.

Candidates for deletion:

Wed, Mar 8, 9:45 AM · Machine-Learning-Team