Page MenuHomePhabricator

kevinbazira (Kevin Bazira, KBazira)
Software Engineer (Machine Learning)

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Aug 3 2019, 6:58 AM (225 w, 4 d)
Availability
Available
IRC Nick
kevinbazira
LDAP User
Kevin Bazira
MediaWiki User
KBazira (WMF) [ Global Accounts ]

Recent Activity

Today

kevinbazira committed rMLISa8006485f3c6: article-descriptions: remove host header from rest-gateway endpoint (authored by kevinbazira).
article-descriptions: remove host header from rest-gateway endpoint
Wed, Nov 29, 11:40 AM
kevinbazira committed rMLISbbe397fb4d34: article-descriptions: fix wikipedia api summary endpoint (authored by kevinbazira).
article-descriptions: fix wikipedia api summary endpoint
Wed, Nov 29, 10:07 AM

Yesterday

kevinbazira added a comment to T348156: Goal: Increase the number of models hosted on Lift Wing.

Working on migrating the machine-generated article descriptions model from toolforge to LiftWing:

  • added article-descriptions model-server to LiftWing inference services repo
  • added CI pipeline jobs to test and publish the article-descriptions model-server image to the Wikimedia docker registry
  • uploaded article-descriptions model files to swift in mbart-large-cc25 and bert-base-multilingual-uncased paths.
  • added the article-descriptions inference service to the experimental namespace on LiftWing
  • fixed the model-server to use the local_files_only parameter to instantiate the pretrained pytorch tokenizer from local files only without having to download from huggingface.co.
  • in T351940#9359437 fixed the AsyncSession host header issue experienced in T351940#9358303.
  • currently working on fixing the wikipedia api summary endpoint as we have to use a k8s internal endpoint to access it.
Tue, Nov 28, 3:41 PM · Goal, Machine-Learning-Team

Mon, Nov 27

kevinbazira committed rMLISc46ed2a8e026: article-descriptions: fix AsyncSession host header (authored by kevinbazira).
article-descriptions: fix AsyncSession host header
Mon, Nov 27, 4:17 PM
kevinbazira added a comment to T351940: Enable local runs for article-descriptions model.

I dicovered what was causing this issue and pushed a patch for it here.

Mon, Nov 27, 3:03 PM · Patch-For-Review, Machine-Learning-Team

Fri, Nov 24

kevinbazira committed rMLIS3a31216609d0: article-descriptions: update wiki host headers in model-server (authored by kevinbazira).
article-descriptions: update wiki host headers in model-server
Fri, Nov 24, 3:15 PM

Thu, Nov 23

kevinbazira committed rMLIS18327eecd382: article-descriptions: update model-server to use local files only (authored by kevinbazira).
article-descriptions: update model-server to use local files only
Thu, Nov 23, 2:53 PM

Mon, Nov 20

kevinbazira committed rMLIS0ae08690aa2b: article-descriptions: add article-descriptions model server (authored by kevinbazira).
article-descriptions: add article-descriptions model server
Mon, Nov 20, 3:38 PM

Wed, Nov 15

kevinbazira added a comment to T340944: The published dataset's list of wikis misses a couple of wikis with existing data.

@Sgs, ttermwiki has been removed from the published datasets, it was created to test the unpublish-datasets script in T344799.

Wed, Nov 15, 9:10 AM · Growth-Team (Sprint 3 (Growth Team)), Add-Link

Tue, Nov 14

kevinbazira added a comment to T348156: Goal: Increase the number of models hosted on Lift Wing.

1.Isaac from the research team tested the deployed rec-api and shared 2 edge cases:

  • the rec-api wasn't returning results besides 'spec' param, we investigated this in T347475, noticed envoy proxy constraints, and fixed them in T348607.
  • the rec-api was returning empty results when a query was made with 'seed' param not specified, we discovered that the pageviews envoy settings weren't correct and fixed them. We later updated them based on the wp-analytics team notice in T348607#9283681.

2.Started working on migrating the machine-generated article descriptions model from toolforge to LiftWing

Tue, Nov 14, 4:01 PM · Goal, Machine-Learning-Team

Tue, Nov 7

kevinbazira added a comment to T343123: Migrate Machine-generated Article Descriptions from toolforge to liftwing..

@Isaac thank you for sharing the new codebase that has fewer dependencies. We are adapting the article-descriptions kserve model-server to use this codebase.

Tue, Nov 7, 5:42 AM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24), Machine-Learning-Team

Fri, Nov 3

kevinbazira updated the task description for T347262: Set SLO for the recommendation-api-ng service hosted on LiftWing.
Fri, Nov 3, 5:12 AM · Machine-Learning-Team

Thu, Nov 2

kevinbazira added a comment to T343123: Migrate Machine-generated Article Descriptions from toolforge to liftwing..

Hi @Isaac and @Seddon, while working on the migration of the article-descriptions model from Toolforge to LiftWing, we noticed that the GitHub repository, which this project relies on (including the Toolforge instance and some of the LiftWing dependencies), is not owned by WMF but is owned by an individual. In case the owner decides to delete it, the LiftWing model-server won't have the necessary dependencies. Do you plan to move this repository to WMF's Gerrit or GitLab?

Thu, Nov 2, 7:14 AM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24), Machine-Learning-Team

Oct 26 2023

kevinbazira added a comment to T348607: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing.

Folks from wikimedia-analytics notified us that they are migrating the pageviews endpoint as shown in the screenshot below:

notice - pageviews endpoint migration from retbase to rest-gateway : Screenshot from 2023-10-24 18-06-29.png (459×942 px, 83 KB)

We have updated the rec-api-ng envoy settings for the pageviews endpoint from:

uri: http://localhost:6020/analytics.wikimedia.org/v1/pageview
listener: aqs

to

uri: http://localhost:6033/wikimedia.org/v1/metrics/pageviews
listener: rest-gateway
Oct 26 2023, 11:17 AM · Machine-Learning-Team

Oct 24 2023

kevinbazira added a comment to T347475: Investigate recommendation-api-ng internal endpoint failure.

Thank you for sharing the notes, @Isaac. We were able to reproduce this issue in T348607#9275075 and fixed it in T348607#9275348. The rec-api-ng now returns results whether or not the seed parameter is specified as shown below:

Oct 24 2023, 10:24 AM · Machine-Learning-Team
kevinbazira added a comment to T348607: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing.

Since we couldn't experiment and check the correct pageviews uri from k8s, an SRE had to log into one of the AQS nodes and tested the uri with curl until we got the correct one: http://localhost:6020/analytics.wikimedia.org/v1/pageviews. We've added it to the rec-api-ng, deployed it on staging, tested that it works, and then deployed it to prod.

Oct 24 2023, 10:02 AM · Machine-Learning-Team
kevinbazira added a comment to T348607: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing.

As reported in T347475#9269007, the pageviews endpoint is failing. I reproduced this issue on staging and here are the logs showing that the endpoint fails because the URI:

2023-10-24 05:09:12,783 recommendation.utils.event_logger log_api_request():37 INFO -- Logging event: {"schema": "TranslationRecommendationAPIRequests", "$schema": "/analytics/legacy/$translationrecommendationapirequests/1.0.0", "revision": 16261139, "event": {"timestamp": 1698124152, "sourceLanguage": "en", "targetLanguage": "es", "searchAlgorithm": "related_articles"}, "webHost": "recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443", "client_dt": "2023-10-24T05:09:12.783157", "meta": {"stream": "eventlogging_TranslationRecommendationAPIRequests", "domain": "recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443"}}
2023-10-24 05:09:12,801 recommendation.api.external_data.fetcher get():26 INFO -- Request failed: {"url": "http://localhost:6020/analytics.wikimedia.org/v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/22", "error": "404 Client Error: Not Found for url: http://localhost:6020/analytics.wikimedia.org/v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/22"}
2023-10-24 05:09:12,801 recommendation.api.external_data.fetcher get_most_popular_articles():143 INFO -- pageview query failed
2023-10-24 05:09:12,847 recommendation.api.types.translation.translation process_request():195 INFO -- Request processed in 0.064753 seconds
[pid: 135|app: 0|req: 7360/10465] 127.0.0.1 () {44 vars in 2182 bytes} [Tue Oct 24 05:09:12 2023] GET /types/translation/v1/articles?source=en&target=es&seed=&search=related_articles&application=CX => generated 3 bytes in 65 msecs (HTTP/1.1 200) 5 headers in 195 bytes (1 switches on core 0)
Oct 24 2023, 8:32 AM · Machine-Learning-Team

Oct 23 2023

kevinbazira added a comment to T348607: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing.

To help others in future who might have challenges to resolve the issue above, I have added a note to the envoy proxy docs that shows the same host could use different envoy listeners based on the endpoint being accessed:
https://wikitech.wikimedia.org/w/index.php?title=Envoy&oldid=2121603#Example_(calling_mw-api)

Oct 23 2023, 7:33 AM · Machine-Learning-Team

Oct 20 2023

kevinbazira added a comment to T347475: Investigate recommendation-api-ng internal endpoint failure.

@Isaac, the rec-api-ng endpoint now works as shown below. Please let us know whether there are edge cases we might have missed:

$ time curl "https://recommendation-api-ng.discovery.wmnet:31443/api/?s=en&t=fr&n=3&article=Basketball"
[{"pageviews": 0, "title": "2019_United_States_FIBA_Basketball_World_Cup_team", "wikidata_id": "Q56042822", "rank": 495.0}, {"pageviews": 0, "title": "Molly_Bolin", "wikidata_id": "Q27451749", "rank": 482.0}, {"pageviews": 0, "title": "Dom_Flora", "wikidata_id": "Q5289728", "rank": 478.0}]
Oct 20 2023, 11:46 AM · Machine-Learning-Team
kevinbazira added a comment to T347475: Investigate recommendation-api-ng internal endpoint failure.

In T348607#9267956 we fixed the envoy listener for the pageviews endpoint. Now the rec-api-ng is able to access all external endpoints from k8s/LiftWing.

Oct 20 2023, 11:43 AM · Machine-Learning-Team
kevinbazira added a comment to T348607: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing.

We discovered that the pageviews endpoint does not use mw-api-int-async-ro but rather aqs listener. The full settings that enabled the rec-api-ng hosted on LiftWing to access endpoints external to k8s/LiftWing through the envoy proxy are:

endpoint nameenpoint host headerenpoint urienvoy listener
language_pairscxserver.wikimedia.orghttp://localhost:6015/v1/languagepairscxserver
pageviewswikimedia.orghttp://localhost:6033/wikimedia.org/v1/metrics/pageviewsrest-gateway
wikipedia{source}.wikipedia.orghttp://localhost:6500/w/api.phpmw-api-int-async-ro
wikidatawww.wikidata.orghttp://localhost:6500/w/api.phpmw-api-int-async-ro
event_loggerintake-analytics.wikimedia.orghttp://localhost:6004/v1/events?hasty=trueeventgate-analytics
Oct 20 2023, 11:20 AM · Machine-Learning-Team

Oct 19 2023

kevinbazira added a comment to T347475: Investigate recommendation-api-ng internal endpoint failure.

We resolved the envoy proxy issues as shown in T348607#9264192. However, the rec-api instance hosted on LiftWing is still not performing as expected. The container logs on staging show that the worker processes are being terminated when the rec-api is processing a large number of results (501) from the "morelike" external endpoint.

Wed Oct 18 10:32:52 2023 - *** HARAKIRI ON WORKER 1 (pid: 134, try: 1) ***
Wed Oct 18 10:32:52 2023 - HARAKIRI !!! worker 1 status !!!
Wed Oct 18 10:32:52 2023 - HARAKIRI [core 0] 127.0.0.1 - GET /api/?s=en&t=fr&n=3&article=Apple since 1697625156
Wed Oct 18 10:32:52 2023 - HARAKIRI !!! end of worker 1 status !!!
2023-10-18 10:32:52,799 recommendation.utils.event_logger log_api_request():39 INFO -- Logging event: {"schema": "TranslationRecommendationAPIRequests", "$schema": "/analytics/legacy/$translationrecommendationapirequests/1.0.0", "revision": 16261139, "event": {"timestamp": 1697625172, "sourceLanguage": "en", "targetLanguage": "fr", "seed": "Apple", "searchAlgorithm": "morelike"}, "webHost": "recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443", "client_dt": "2023-10-18T10:32:52.799807", "meta": {"stream": "eventlogging_TranslationRecommendationAPIRequests", "domain": "recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443"}}
2023-10-18 10:32:53,089 recommendation.api.types.translation.candidate_finders get_morelike_candidates():39 INFO -- morelike returned 501 results
DAMN ! worker 1 (pid: 134) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 1 (new pid: 183)
[pid: 183|app: 0|req: 8/35] 10.192.0.201 () {30 vars in 376 bytes} [Wed Oct 18 10:33:02 2023] GET /api/spec => generated 1908 bytes in 4 msecs (HTTP/1.1 200) 5 headers in 198 bytes (1 switches on core 0)
Wed Oct 18 10:33:08 2023 - *** HARAKIRI ON WORKER 2 (pid: 166, try: 1) ***
Wed Oct 18 10:33:08 2023 - HARAKIRI !!! worker 2 status !!!
Wed Oct 18 10:33:08 2023 - HARAKIRI [core 0] 127.0.0.1 - GET /api/?s=en&t=fr&n=3&article=Apple since 1697625172
Wed Oct 18 10:33:08 2023 - HARAKIRI !!! end of worker 2 status !!!
DAMN ! worker 2 (pid: 166) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 2 (new pid: 187)

We are going to experiment with different resource limits and monitor the rec-api-ng performance on staging to identify the optimal settings that will enable the API to return results and perform as expected.

Oct 19 2023, 9:09 AM · Machine-Learning-Team
kevinbazira added a comment to T348607: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing.

We configured envoy settings and the rec-api-ng container hosted on LiftWing is able to access endpoints external to k8s/LiftWing. Below are the settings we configured:

endpoint nameenpoint host headerenvoy listenerport
language_pairscxserver.wikimedia.orgcxserver6015
pageviewswikimedia.orgmw-api-int-async-ro6500
wikipedia{source}.wikipedia.orgmw-api-int-async-ro6500
wikidatawww.wikidata.orgmw-api-int-async-ro6500
event_loggerintake-analytics.wikimedia.orgeventgate-analytics6004
Oct 19 2023, 8:48 AM · Machine-Learning-Team

Oct 11 2023

kevinbazira added a comment to T347475: Investigate recommendation-api-ng internal endpoint failure.

Following T347475#9235065, it was clear that the rec-api-ng instance on LiftWing doesn't work like the rec-api container outside LiftWing, despite having similar resources allocated. We did some troubleshooting with @elukey on IRC and discovered that the container on LiftWing can not access wikimedia, wikidata, and wikipedia endpoints. In T348607 we are configuring envoy to enable this instance to access endpoints external to k8s/LiftWing.

Oct 11 2023, 7:59 AM · Machine-Learning-Team
kevinbazira created T348607: Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing.
Oct 11 2023, 7:15 AM · Machine-Learning-Team

Oct 9 2023

kevinbazira added a comment to T347475: Investigate recommendation-api-ng internal endpoint failure.

Thanks @isarantopoulos and @Isaac for the suggestions!

Oct 9 2023, 10:18 AM · Machine-Learning-Team

Oct 5 2023

kevinbazira updated subscribers of T308144: Deploy "add a link" to 18th round of wikis (en.wp and de.wp).

@Trizek-WMF, thank you so much for sharing this script that helps to curb overlinking. I am looping in @MGerlach, since he will work on improving add-a-link model performance, this might interest him.

Oct 5 2023, 4:11 PM · Machine-Learning-Team, Growth-Team, User-notice, Add-Link

Oct 4 2023

kevinbazira added a comment to T347475: Investigate recommendation-api-ng internal endpoint failure.

Thanks @elukey. Currently, these are the resources allocated to the recommendation-api pod. I am not aware of the amount of resources available on LiftWing. Is it possible to increase the resources allocated to the pod so that we can eliminate the possibility of resource constraints affecting the performance of the endpoint?

Oct 4 2023, 10:03 AM · Machine-Learning-Team

Oct 3 2023

kevinbazira added a comment to T347475: Investigate recommendation-api-ng internal endpoint failure.

Investigation Report

Without being able to fully test the recommendation-api on wmflabs and LiftWing, I ran a couple of experiments to investigate the cause of the performance disparity between the recommendation-api hosted on wmflabs and LiftWing, which is causing the internal endpoint to hang (as shown in this task's description). The steps I took and the results achieved are detailed below.

Oct 3 2023, 8:12 AM · Machine-Learning-Team
kevinbazira added a comment to T347263: Create external endpoint for recommendation-api-ng hosted on LiftWing.

Thank you for testing the internal endpoint @Isaac. We are investigating the cause of this issue in T347475 and a possible solution for it.

Oct 3 2023, 8:03 AM · Machine-Learning-Team
kevinbazira closed T336927: Completion report on training 18 rounds of add-a-link models, a subtask of T304110: [EPIC] Deploy "add a link" to all Wikipedias, as Resolved.
Oct 3 2023, 6:16 AM · CommRel-Specialists-Support, Growth-Team, Epic, Add-Link
kevinbazira closed T336927: Completion report on training 18 rounds of add-a-link models as Resolved.
Oct 3 2023, 6:16 AM · Machine-Learning-Team, Add-Link, Growth-Team
kevinbazira closed T343374: Create a single table with evaluation metrics from all trained add-a-link models, a subtask of T336927: Completion report on training 18 rounds of add-a-link models, as Resolved.
Oct 3 2023, 6:15 AM · Machine-Learning-Team, Add-Link, Growth-Team
kevinbazira closed T343374: Create a single table with evaluation metrics from all trained add-a-link models as Resolved.
Oct 3 2023, 6:15 AM · Growth-Team, Machine-Learning-Team, Add-Link

Oct 2 2023

kevinbazira added a comment to T308139: Deploy "add a link" to 14th round of wikis.

I've checked the enabled wikis and all present a fair amount of results except for:

  • xalwiki returns 5 results
  • xmfwiki returns 3 results
  • xhwiki returns 0 results

@kevinbazira do you have any clues on why the model produces few results in the mentioned wikis?

Oct 2 2023, 7:46 PM · User-notice-archive, Growth-Team (Sprint 1 (Growth Team)), Machine-Learning-Team, Chinese-Sites, Add-Link

Sep 27 2023

kevinbazira created T347475: Investigate recommendation-api-ng internal endpoint failure.
Sep 27 2023, 11:30 AM · Machine-Learning-Team

Sep 26 2023

kevinbazira added a comment to T341704: Content Recommendation API migration .

We finally got rec-api deployment settings that could run on LiftWing:

  1. a 4th test was run and this dropped the memory usage by 5x
  2. we disabled the related_articles service (as advised by the research team), the rec-api ended up running without the embedding.
  3. deployed the rec-api on staging and then finally in production
Sep 26 2023, 2:31 PM · Goal, Machine-Learning-Team
kevinbazira added a comment to T347387: Add JavaScript examples to LiftWing API gateway docs.

Also added the user-agent header in the the LiftWing usage docs for all examples (curl, python, and JS):

  1. internal endpoints: https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Usage&oldid=2114576#Example_usage_of_internal_endpoint
  2. external endpoints: https://wikitech.wikimedia.org/w/index.php?title=Machine_Learning/LiftWing/Usage&oldid=2114576#Example_usage_of_external_endpoint
Sep 26 2023, 10:57 AM · Documentation, Machine-Learning-Team
kevinbazira added a comment to T347387: Add JavaScript examples to LiftWing API gateway docs.

JavaScript examples have been added to the LiftWing API gateway docs as shown below:

  1. revscoring goodfaith prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_goodfaith_prediction#Examples
  2. revscoring damaging prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_damaging_prediction#Examples
  3. revscoring reverted prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_reverted_prediction#Examples
  4. revscoring drafttopic prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_drafttopic_prediction#Examples
  5. revscoring draftquality prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_draftquality_prediction#Examples
  6. revscoring articlequality prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_articlequality_prediction#Examples
  7. revscoring articletopic prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_articletopic_prediction#Examples
  8. reverted risk multilingual prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_reverted_risk_multilingual_prediction#Examples
  9. revert risk language agnostic prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_reverted_risk_language_agnostic_prediction#Examples
  10. articletopic outlink prediction : https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_articletopic_outlink_prediction#Examples#ooui-5
Sep 26 2023, 10:57 AM · Documentation, Machine-Learning-Team
kevinbazira created T347387: Add JavaScript examples to LiftWing API gateway docs.
Sep 26 2023, 10:56 AM · Documentation, Machine-Learning-Team

Sep 25 2023

kevinbazira updated subscribers of T347263: Create external endpoint for recommendation-api-ng hosted on LiftWing.

Hi @Isaac, @santhosh, and @Seddon. The ML team was assigned the task of migrating the recommendation-api from wmflabs to LiftWing. We have successfully deployed the recommendation-api on LiftWing.

Sep 25 2023, 8:14 AM · Machine-Learning-Team
kevinbazira placed T347263: Create external endpoint for recommendation-api-ng hosted on LiftWing up for grabs.
Sep 25 2023, 7:47 AM · Machine-Learning-Team
kevinbazira created T347263: Create external endpoint for recommendation-api-ng hosted on LiftWing.
Sep 25 2023, 7:46 AM · Machine-Learning-Team
kevinbazira updated subscribers of T347262: Set SLO for the recommendation-api-ng service hosted on LiftWing.
Sep 25 2023, 7:44 AM · Machine-Learning-Team
kevinbazira created T347262: Set SLO for the recommendation-api-ng service hosted on LiftWing.
Sep 25 2023, 7:38 AM · Machine-Learning-Team
kevinbazira added a comment to T347015: Deploy the recommendation-api-ng on LiftWing.

The recommendation-api-ng has successfully been deployed to LiftWing production in both eqiad and codfw:

kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kube_env recommendation-api-ng ml-serve-eqiad
kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kubectl get pods
NAME                                          READY   STATUS    RESTARTS   AGE
recommendation-api-ng-main-5c4f58c685-jts5q   2/2     Running   0          8m24s
recommendation-api-ng-main-5c4f58c685-m4mfs   2/2     Running   0          8m24s
recommendation-api-ng-main-5c4f58c685-q8vdn   2/2     Running   0          8m24s
recommendation-api-ng-main-5c4f58c685-snn99   2/2     Running   0          8m24s
recommendation-api-ng-main-5c4f58c685-tfdmt   2/2     Running   0          8m24s
...
kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kube_env recommendation-api-ng ml-serve-codfw
kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kubectl get pods
NAME                                          READY   STATUS    RESTARTS   AGE
recommendation-api-ng-main-5c4f58c685-dc22j   2/2     Running   0          7m37s
recommendation-api-ng-main-5c4f58c685-mqm97   2/2     Running   0          7m37s
recommendation-api-ng-main-5c4f58c685-s47l8   2/2     Running   0          7m37s
recommendation-api-ng-main-5c4f58c685-w8s2p   2/2     Running   0          7m37s
recommendation-api-ng-main-5c4f58c685-zw9tx   2/2     Running   0          7m37s
Sep 25 2023, 7:33 AM · Machine-Learning-Team

Sep 21 2023

kevinbazira added a comment to T347015: Deploy the recommendation-api-ng on LiftWing.

The recommendation-api-ng has been successfully deployed on LiftWing staging:

kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services$ curl https://recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443/api/spec
{"basePath":"/api","consumes":["application/json"],"definitions":{"Article":{"properties":{"pageviews":{"description":"pageviews","type":"integer"},"rank":{"description":"rank","type":"number"},"title":{"description":"title","type":"string"},"wikidata_id":{"description":"wikidata_id","type":"string"}},"required":["rank","title","wikidata_id"],"type":"object"}},"info":{"title":"API","version":"1.0"},"paths":{"/":{"get":{"deprecated":true,"description":"Gets recommendations of source articles that are missing in the target","operationId":"get_legacy_article","parameters":[{"description":"Source wiki project language code","in":"query","name":"s","required":true,"type":"string"},{"description":"Target wiki project language code","in":"query","name":"t","required":true,"type":"string"},{"default":12,"description":"Number of recommendations to fetch","in":"query","maximum":24,"minimum":0,"name":"n","type":"integer"},{"description":"Seed article for personalized recommendations that can also be a list separated by \"|\"","in":"query","name":"article","pattern":"^([^|]+(\\|[^|]+)*)?$","type":"string"},{"default":true,"description":"Whether to include pageview counts","in":"query","name":"pageviews","type":"boolean"},{"collectionFormat":"multi","default":"morelike","description":"Which search algorithm to use if a seed is specified","enum":["morelike","wiki"],"in":"query","name":"search","type":"string"}],"responses":{"200":{"description":"Success","schema":{"items":{"$ref":"#/definitions/Article"},"type":"array"}}},"tags":["default"]}},"/spec":{"get":{"operationId":"get_spec","responses":{"200":{"description":"Success"}},"tags":["default"]}}},"produces":["application/json"],"responses":{"MaskError":{"description":"When any error occurs on mask"},"ParseError":{"description":"When a mask can't be parsed"}},"swagger":"2.0","tags":[{"description":"Default namespace","name":"default"}]}
Sep 21 2023, 3:45 PM · Machine-Learning-Team
kevinbazira updated subscribers of T347015: Deploy the recommendation-api-ng on LiftWing.

On IRC, @elukey advised that we use deploy2002 due to a recent datacenter switchover. The second attempt to deploy on staging returned:

kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ helmfile -e ml-staging-codfw sync
Affected releases are:
  main (wmf-stable/python-webapp) UPDATED
Sep 21 2023, 10:15 AM · Machine-Learning-Team
kevinbazira added a comment to T347015: Deploy the recommendation-api-ng on LiftWing.

The attempt to deploy the rec-api on LW staging returned:

kevinbazira@deploy1002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ helmfile -e ml-staging-codfw sync
Affected releases are:
  main (wmf-stable/python-webapp) UPDATED
Sep 21 2023, 9:31 AM · Machine-Learning-Team
kevinbazira created T347015: Deploy the recommendation-api-ng on LiftWing.
Sep 21 2023, 9:29 AM · Machine-Learning-Team

Sep 20 2023

kevinbazira added a comment to T339890: Host the recommendation-api container on LiftWing.

@Isaac, thank you for the pointer. I have tested the recommendation-api with related_articles switched off and it ran without the errors.

Sep 20 2023, 7:44 AM · Patch-For-Review, Machine-Learning-Team

Sep 19 2023

kevinbazira moved T346218: Adapt the recommendation-api to use float32 preprocessed numpy arrays from swift from Unsorted to In Progress on the Machine-Learning-Team board.
Sep 19 2023, 7:16 AM · Machine-Learning-Team
kevinbazira moved T346411: Upload recommendation-api preprocessed numpy binaries to Swift from Unsorted to In Progress on the Machine-Learning-Team board.
Sep 19 2023, 7:16 AM · Machine-Learning-Team
kevinbazira added a comment to T339890: Host the recommendation-api container on LiftWing.

Thank you for sharing this information, @Isaac.

Sep 19 2023, 5:51 AM · Patch-For-Review, Machine-Learning-Team

Sep 15 2023

kevinbazira updated the task description for T346218: Adapt the recommendation-api to use float32 preprocessed numpy arrays from swift.
Sep 15 2023, 7:09 AM · Machine-Learning-Team
kevinbazira added a comment to T346411: Upload recommendation-api preprocessed numpy binaries to Swift.

The recommendation-api preprocessed numpy binaries were uploaded successfully to Thanos Swift. Below are their storage URIs:

fileuri
wikidata_ids.npys3://wmf-ml-models/recommendation_api/enwiki/20230915010659/wikidata_ids.npy
decoded_lines_float32.npys3://wmf-ml-models/recommendation_api/enwiki/20230915011054/decoded_lines_float32.npy
Sep 15 2023, 1:28 AM · Machine-Learning-Team
kevinbazira created T346411: Upload recommendation-api preprocessed numpy binaries to Swift.
Sep 15 2023, 1:25 AM · Machine-Learning-Team

Sep 13 2023

kevinbazira created T346218: Adapt the recommendation-api to use float32 preprocessed numpy arrays from swift.
Sep 13 2023, 10:01 AM · Machine-Learning-Team
kevinbazira added a comment to T339890: Host the recommendation-api container on LiftWing.

A 4th memory usage test was run using a combination of the 2nd (float downcasted to np.float32) and 3rd (preprocessed numpy arrays). Below are the steps taken and the results:

Sep 13 2023, 9:24 AM · Patch-For-Review, Machine-Learning-Team

Sep 12 2023

kevinbazira added a comment to T341704: Content Recommendation API migration .

We are working on making adjustments to the rec-api deployment settings until we get to a state that can run on LiftWing.

Sep 12 2023, 2:31 PM · Goal, Machine-Learning-Team
kevinbazira added a comment to T339890: Host the recommendation-api container on LiftWing.

As suggested in T339890#9156420, I run the load_raw_embedding method and saved both wikidata_ids and decoded_lines numpy arrays:

...
np.save('wikidata_ids.npy', self.wikidata_ids)
np.save('decoded_lines.npy', self.embedding)
.
.
.
root@a0efd763f796:/home/recommendation-api# ls -lh
-rw-r--r-- 1 root root 2.4G Sep 12 06:53 decoded_lines.npy
-rw-r--r-- 1 root root 107M Sep 12 06:53 wikidata_ids.npy
Sep 12 2023, 8:03 AM · Patch-For-Review, Machine-Learning-Team

Sep 11 2023

kevinbazira updated subscribers of T339890: Host the recommendation-api container on LiftWing.

@calbon, suggested that we use dtype=np.float32 to reduce memory usage. I have tested it and below are the results:

dtypeon-loadsteady-state
float~7GB~5GB
np.float32~3.8GB~2.8GB
Sep 11 2023, 2:25 PM · Patch-For-Review, Machine-Learning-Team
kevinbazira added a comment to T339890: Host the recommendation-api container on LiftWing.

@elukey, on IRC you mentioned:

Sep 11 2023, 11:34 AM · Patch-For-Review, Machine-Learning-Team
kevinbazira added a comment to T339890: Host the recommendation-api container on LiftWing.

@elukey, regarding the rec-api memory usage, please see the findings below got from docker stats after monitoring the rec-api container running locally:

Sep 11 2023, 11:33 AM · Patch-For-Review, Machine-Learning-Team

Sep 8 2023

kevinbazira added a comment to T339890: Host the recommendation-api container on LiftWing.

Deployment settings for the recommendation-api-ng have been merged but when we try to deploy on staging we get:

Sep 8 2023, 1:51 PM · Patch-For-Review, Machine-Learning-Team

Sep 5 2023

kevinbazira added a comment to T341704: Content Recommendation API migration .

In T339890, we are working on hosting the recommendation-api container on LiftWing. Tasks completed so far are:

  1. stored the embedding on swift
  2. added a swift client to the recommendation-api
  3. created CI jobs named recommendation-api-ng
  4. published the rec-api image to the Wikimedia docker registry
Sep 5 2023, 12:54 PM · Goal, Machine-Learning-Team
kevinbazira added a comment to T341704: Content Recommendation API migration .

In T338805, we containerized the Flask web application that runs the Content Translation Recommendation API. Here is a summary of the steps that were taken:

  1. created a docker image after wrangling dependencies and configurations that were set in 2016
  2. fetched the embedding from figshare and hosted it locally in the container
  3. tested the backend and were able to hit the recommendation-api endpoint
  4. set up the frontend and were able to interact with GapFinder
Sep 5 2023, 12:54 PM · Goal, Machine-Learning-Team

Aug 31 2023

kevinbazira closed T344319: Remove models with poor evaluation metrics from the published datasets repo, a subtask of T309263: Support languages whose add-a-link models were not published, as Resolved.
Aug 31 2023, 3:19 PM · CommRel-Specialists-Support (Oct-Dec-2023), Chinese-Sites, Machine-Learning-Team, Growth-Team, Add-Link
kevinbazira closed T344319: Remove models with poor evaluation metrics from the published datasets repo as Resolved.

To prevent mishaps like T344319#9109329 in the future, we have automated the unpublishing process of add-a-link datasets using this script: https://github.com/wikimedia/research-mwaddlink/blob/main/unpublish-datasets.sh
Moving forward, to unpublish a given wiki's datasets, one can run the following command:

WIKI_ID=<WIKI_ID> ./unpublish-datasets.sh
Aug 31 2023, 3:19 PM · Growth-Team, Machine-Learning-Team, Add-Link
kevinbazira committed rRMWA0a6d3b9fd21a: Add unpublish script (authored by kevinbazira).
Add unpublish script
Aug 31 2023, 2:18 PM

Aug 30 2023

kevinbazira closed T344832: Investigate why the add-a-link training pipeline concludes with missing datasets, a subtask of T344799: Automate unpublishing of add-a-link datasets, as Resolved.
Aug 30 2023, 1:18 PM · Growth-Team (Sprint 0 (Growth Team)), Machine-Learning-Team, Add-Link
kevinbazira closed T344832: Investigate why the add-a-link training pipeline concludes with missing datasets as Resolved.

The model training pipeline has been fixed and it now generates all the expected datasets that should be published. Below is a summary of what was fixed:

  1. use conda env (python3.10) to run spark jobs
  2. spark commands distribute conda env as an archive to the jobs
  3. use python3.7 env to:
    • train wikipedia2vec and filter its output
    • enable sqlitedict to use pickle protocol 4 instead of 5
  4. update README docs on setting up the 2 python envs mentioned above
  5. adapt requirements.txt dependency versions to support python3.7
  6. explicitly set protocol 4 in both generate_anchor_dictionary and generate_wdproperties_spark scripts that were still using pickle protocol 5 via HIGHEST_PROTOCOL.
Aug 30 2023, 1:18 PM · Growth-Team, Machine-Learning-Team, Add-Link
kevinbazira committed rRMWA8f60ae7e4e95: Fix model training pipeline (authored by kevinbazira).
Fix model training pipeline
Aug 30 2023, 12:13 PM

Aug 29 2023

kevinbazira added a comment to T344832: Investigate why the add-a-link training pipeline concludes with missing datasets.

Following T345091#9124589, we learned that the sqlitedict package was using pickle protocol 5 via HIGHEST_PROTOCOL within the python3.10 environment. To avoid running into the unsupported pickle protocol: 5 issue, we are going to run all scripts that rely on this package within the python3.7 environment.

Aug 29 2023, 12:47 PM · Growth-Team, Machine-Learning-Team, Add-Link
kevinbazira added a comment to T345091: research/mwaddlink has an all-repo CI failure.

Great. Thank you @Urbanecm_WMF and @MGerlach for resolving this while I was away. Sorry for the inconvenience.

Aug 29 2023, 6:15 AM · Growth-Team (Sprint 0 (Growth Team)), ci-test-error, Add-Link

Aug 28 2023

kevinbazira moved T344832: Investigate why the add-a-link training pipeline concludes with missing datasets from Unsorted to In Progress on the Machine-Learning-Team board.
Aug 28 2023, 8:19 AM · Growth-Team, Machine-Learning-Team, Add-Link
kevinbazira claimed T344832: Investigate why the add-a-link training pipeline concludes with missing datasets.

We adapted the training pipeline:

  1. spark jobs run within the conda env (python3.10)
  2. spark commands distribute conda env as an archive to the jobs
  3. python3.7 env is used to train wikipedia2vec and filter its output
Aug 28 2023, 8:18 AM · Growth-Team, Machine-Learning-Team, Add-Link

Aug 25 2023

kevinbazira added a comment to T344832: Investigate why the add-a-link training pipeline concludes with missing datasets.

The next step was to adapt the entire training pipeline (i.e run-pipeline.sh) to use a packed env and on running it, the error below was thrown:

$ WIKI_ID="simplewiki" ./run-pipeline.sh
.
.
.
RUNNING wikipedia2vec on dump
Traceback (most recent call last):
  File "/home/kevinbazira/.conda/envs/link-recommendation-env/bin/wikipedia2vec", line 5, in <module>
    from wikipedia2vec.cli import cli
  File "/home/kevinbazira/.conda/envs/link-recommendation-env/lib/python3.10/site-packages/wikipedia2vec/__init__.py", line 4, in <module>
    from .dictionary import Dictionary
  File "wikipedia2vec/dictionary.pyx", line 1, in init wikipedia2vec.dictionary
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
Aug 25 2023, 1:06 PM · Growth-Team, Machine-Learning-Team, Add-Link
kevinbazira updated subscribers of T344832: Investigate why the add-a-link training pipeline concludes with missing datasets.

@MGerlach and I looked into why the spark jobs in the training pipeline weren't running and discovered that they weren't able to run python even when it existed. The error below was being thrown:

(an-worker1092.eqiad.wmnet executor 1): java.io.IOException: Cannot run program "/home/kevinbazira/.conda/envs/adda-link-env/bin/python3": error=2, No such file or directory
Aug 25 2023, 1:01 PM · Growth-Team, Machine-Learning-Team, Add-Link
kevinbazira added a comment to T344832: Investigate why the add-a-link training pipeline concludes with missing datasets.

Thank you for the clarification @nshahquinn-wmf.

Aug 25 2023, 12:57 PM · Growth-Team, Machine-Learning-Team, Add-Link

Aug 24 2023

kevinbazira added a comment to T344832: Investigate why the add-a-link training pipeline concludes with missing datasets.

Stepped through the execution of the training pipeline and discovered that generate_anchor_dictionary_spark.py does not generate the expected files. This is likely to be caused by a spark session that no longer connects even when kerberos credentials are enabled on stat1008.

Aug 24 2023, 12:56 AM · Growth-Team, Machine-Learning-Team, Add-Link

Aug 23 2023

kevinbazira created T344832: Investigate why the add-a-link training pipeline concludes with missing datasets.
Aug 23 2023, 3:50 PM · Growth-Team, Machine-Learning-Team, Add-Link
kevinbazira moved T344799: Automate unpublishing of add-a-link datasets from Unsorted to In Progress on the Machine-Learning-Team board.
Aug 23 2023, 8:49 AM · Growth-Team (Sprint 0 (Growth Team)), Machine-Learning-Team, Add-Link
kevinbazira moved T344319: Remove models with poor evaluation metrics from the published datasets repo from Unsorted to In Progress on the Machine-Learning-Team board.
Aug 23 2023, 8:49 AM · Growth-Team, Machine-Learning-Team, Add-Link
kevinbazira created T344799: Automate unpublishing of add-a-link datasets.
Aug 23 2023, 8:48 AM · Growth-Team (Sprint 0 (Growth Team)), Machine-Learning-Team, Add-Link

Aug 22 2023

kevinbazira added a comment to T344319: Remove models with poor evaluation metrics from the published datasets repo.

Thank you @Sgs and @Urbanecm_WMF. aswiki has been removed from the published datasets repo and wikis.txt.

Aug 22 2023, 3:05 PM · Growth-Team, Machine-Learning-Team, Add-Link

Aug 21 2023

kevinbazira added a comment to T344319: Remove models with poor evaluation metrics from the published datasets repo.

ganwiki and krcwiki datasets have been removed from the published datasets repo.

Aug 21 2023, 12:59 PM · Growth-Team, Machine-Learning-Team, Add-Link

Aug 18 2023

kevinbazira added a comment to T344319: Remove models with poor evaluation metrics from the published datasets repo.

Thank you for the follow-up, @Sgs. We are going to go ahead and remove ganwiki and krcwiki as we wait for aswiki.

Aug 18 2023, 1:19 PM · Growth-Team, Machine-Learning-Team, Add-Link

Aug 16 2023

kevinbazira updated the task description for T344319: Remove models with poor evaluation metrics from the published datasets repo.
Aug 16 2023, 8:47 AM · Growth-Team, Machine-Learning-Team, Add-Link
kevinbazira created T344319: Remove models with poor evaluation metrics from the published datasets repo.
Aug 16 2023, 8:24 AM · Growth-Team, Machine-Learning-Team, Add-Link
kevinbazira updated the task description for T309263: Support languages whose add-a-link models were not published.
Aug 16 2023, 6:27 AM · CommRel-Specialists-Support (Oct-Dec-2023), Chinese-Sites, Machine-Learning-Team, Growth-Team, Add-Link

Aug 11 2023

kevinbazira closed T342084: Post-merge build failed due to Internal Server Error as Resolved.
Aug 11 2023, 8:50 AM · ci-test-error, Release Pipeline, Machine-Learning-Team
kevinbazira closed T342084: Post-merge build failed due to Internal Server Error, a subtask of T288198: Pushes to docker-registry fail for images with compressed layers of size >1GB, as Resolved.
Aug 11 2023, 8:50 AM · Release Pipeline, MW-on-K8s, serviceops
kevinbazira closed T342084: Post-merge build failed due to Internal Server Error, a subtask of T339890: Host the recommendation-api container on LiftWing, as Resolved.
Aug 11 2023, 8:49 AM · Patch-For-Review, Machine-Learning-Team
kevinbazira added a comment to T339890: Host the recommendation-api container on LiftWing.

After encountering issues with CI fetching files from the Wikimedia public datasets archive (T341582) and CI post-merge build failure due to an internal server error (T342084). Finally the recommendation-api image has been published to the Docker registry: https://docker-registry.wikimedia.org/wikimedia/research-recommendation-api/tags/. The next step is to deploy it onto LiftWing/k8s.

Aug 11 2023, 8:07 AM · Patch-For-Review, Machine-Learning-Team

Aug 10 2023

kevinbazira added a comment to T343951: Post-merge build succeeded but image not published to docker-registry.

Thanks @elukey, indeed there was a delay for the docker-registry website to sync with the registry. After a couple of hours, the research-recommendation-api image is now visible at https://docker-registry.wikimedia.org/wikimedia/research-recommendation-api/tags/

Aug 10 2023, 12:42 PM · Machine-Learning-Team
kevinbazira created T343951: Post-merge build succeeded but image not published to docker-registry.
Aug 10 2023, 8:36 AM · Machine-Learning-Team
kevinbazira added a comment to T343576: Store and fetch the recommendation-api embedding from Swift.

WMF currently uses Thanos Swift for object storage, but there is a plan to migrate to MOSS, which may run Ceph. Ceph docs (1, 2) show that it supports the Swift API. This Swift client will be able to work for us on both Thanos Swift and MOSS Ceph.

Aug 10 2023, 5:42 AM · Machine-Learning-Team

Aug 9 2023

kevinbazira added a comment to T343576: Store and fetch the recommendation-api embedding from Swift.

A Swift client has been added to the recommendation-api image to enable it to fetch embeddings from Swift.

Aug 9 2023, 4:15 PM · Machine-Learning-Team
kevinbazira added a comment to T342084: Post-merge build failed due to Internal Server Error.

To reduce image layer sizes, in T343576 we store and fetch the ~2.8GB recommendation-api embedding from Swift as recommended in T288198#9037109. This has enabled the post-merge build to succeed: https://integration.wikimedia.org/ci/job/trigger-recommendation-api-ng-pipeline-publish/4/console

Aug 9 2023, 3:48 PM · ci-test-error, Release Pipeline, Machine-Learning-Team