User Details
- User Since
- Aug 3 2019, 6:58 AM (225 w, 4 d)
- Availability
- Available
- IRC Nick
- kevinbazira
- LDAP User
- Kevin Bazira
- MediaWiki User
- KBazira (WMF) [ Global Accounts ]
Today
Yesterday
Working on migrating the machine-generated article descriptions model from toolforge to LiftWing:
- added article-descriptions model-server to LiftWing inference services repo
- added CI pipeline jobs to test and publish the article-descriptions model-server image to the Wikimedia docker registry
- uploaded article-descriptions model files to swift in mbart-large-cc25 and bert-base-multilingual-uncased paths.
- added the article-descriptions inference service to the experimental namespace on LiftWing
- fixed the model-server to use the local_files_only parameter to instantiate the pretrained pytorch tokenizer from local files only without having to download from huggingface.co.
- in T351940#9359437 fixed the AsyncSession host header issue experienced in T351940#9358303.
- currently working on fixing the wikipedia api summary endpoint as we have to use a k8s internal endpoint to access it.
Mon, Nov 27
I dicovered what was causing this issue and pushed a patch for it here.
Fri, Nov 24
Thu, Nov 23
Mon, Nov 20
Wed, Nov 15
@Sgs, ttermwiki has been removed from the published datasets, it was created to test the unpublish-datasets script in T344799.
Tue, Nov 14
1.Isaac from the research team tested the deployed rec-api and shared 2 edge cases:
- the rec-api wasn't returning results besides 'spec' param, we investigated this in T347475, noticed envoy proxy constraints, and fixed them in T348607.
- the rec-api was returning empty results when a query was made with 'seed' param not specified, we discovered that the pageviews envoy settings weren't correct and fixed them. We later updated them based on the wp-analytics team notice in T348607#9283681.
2.Started working on migrating the machine-generated article descriptions model from toolforge to LiftWing
Tue, Nov 7
@Isaac thank you for sharing the new codebase that has fewer dependencies. We are adapting the article-descriptions kserve model-server to use this codebase.
Fri, Nov 3
Thu, Nov 2
Hi @Isaac and @Seddon, while working on the migration of the article-descriptions model from Toolforge to LiftWing, we noticed that the GitHub repository, which this project relies on (including the Toolforge instance and some of the LiftWing dependencies), is not owned by WMF but is owned by an individual. In case the owner decides to delete it, the LiftWing model-server won't have the necessary dependencies. Do you plan to move this repository to WMF's Gerrit or GitLab?
Oct 26 2023
Folks from wikimedia-analytics notified us that they are migrating the pageviews endpoint as shown in the screenshot below:
We have updated the rec-api-ng envoy settings for the pageviews endpoint from:
uri: http://localhost:6020/analytics.wikimedia.org/v1/pageview listener: aqs
to
uri: http://localhost:6033/wikimedia.org/v1/metrics/pageviews listener: rest-gateway
Oct 24 2023
Thank you for sharing the notes, @Isaac. We were able to reproduce this issue in T348607#9275075 and fixed it in T348607#9275348. The rec-api-ng now returns results whether or not the seed parameter is specified as shown below:
Since we couldn't experiment and check the correct pageviews uri from k8s, an SRE had to log into one of the AQS nodes and tested the uri with curl until we got the correct one: http://localhost:6020/analytics.wikimedia.org/v1/pageviews. We've added it to the rec-api-ng, deployed it on staging, tested that it works, and then deployed it to prod.
As reported in T347475#9269007, the pageviews endpoint is failing. I reproduced this issue on staging and here are the logs showing that the endpoint fails because the URI:
2023-10-24 05:09:12,783 recommendation.utils.event_logger log_api_request():37 INFO -- Logging event: {"schema": "TranslationRecommendationAPIRequests", "$schema": "/analytics/legacy/$translationrecommendationapirequests/1.0.0", "revision": 16261139, "event": {"timestamp": 1698124152, "sourceLanguage": "en", "targetLanguage": "es", "searchAlgorithm": "related_articles"}, "webHost": "recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443", "client_dt": "2023-10-24T05:09:12.783157", "meta": {"stream": "eventlogging_TranslationRecommendationAPIRequests", "domain": "recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443"}} 2023-10-24 05:09:12,801 recommendation.api.external_data.fetcher get():26 INFO -- Request failed: {"url": "http://localhost:6020/analytics.wikimedia.org/v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/22", "error": "404 Client Error: Not Found for url: http://localhost:6020/analytics.wikimedia.org/v1/metrics/pageviews/top/en.wikipedia/all-access/2023/10/22"} 2023-10-24 05:09:12,801 recommendation.api.external_data.fetcher get_most_popular_articles():143 INFO -- pageview query failed 2023-10-24 05:09:12,847 recommendation.api.types.translation.translation process_request():195 INFO -- Request processed in 0.064753 seconds [pid: 135|app: 0|req: 7360/10465] 127.0.0.1 () {44 vars in 2182 bytes} [Tue Oct 24 05:09:12 2023] GET /types/translation/v1/articles?source=en&target=es&seed=&search=related_articles&application=CX => generated 3 bytes in 65 msecs (HTTP/1.1 200) 5 headers in 195 bytes (1 switches on core 0)
Oct 23 2023
To help others in future who might have challenges to resolve the issue above, I have added a note to the envoy proxy docs that shows the same host could use different envoy listeners based on the endpoint being accessed:
https://wikitech.wikimedia.org/w/index.php?title=Envoy&oldid=2121603#Example_(calling_mw-api)
Oct 20 2023
@Isaac, the rec-api-ng endpoint now works as shown below. Please let us know whether there are edge cases we might have missed:
$ time curl "https://recommendation-api-ng.discovery.wmnet:31443/api/?s=en&t=fr&n=3&article=Basketball" [{"pageviews": 0, "title": "2019_United_States_FIBA_Basketball_World_Cup_team", "wikidata_id": "Q56042822", "rank": 495.0}, {"pageviews": 0, "title": "Molly_Bolin", "wikidata_id": "Q27451749", "rank": 482.0}, {"pageviews": 0, "title": "Dom_Flora", "wikidata_id": "Q5289728", "rank": 478.0}]
In T348607#9267956 we fixed the envoy listener for the pageviews endpoint. Now the rec-api-ng is able to access all external endpoints from k8s/LiftWing.
We discovered that the pageviews endpoint does not use mw-api-int-async-ro but rather aqs listener. The full settings that enabled the rec-api-ng hosted on LiftWing to access endpoints external to k8s/LiftWing through the envoy proxy are:
endpoint name | enpoint host header | enpoint uri | envoy listener |
language_pairs | cxserver.wikimedia.org | http://localhost:6015/v1/languagepairs | cxserver |
pageviews | wikimedia.org | http://localhost:6033/wikimedia.org/v1/metrics/pageviews | rest-gateway |
wikipedia | {source}.wikipedia.org | http://localhost:6500/w/api.php | mw-api-int-async-ro |
wikidata | www.wikidata.org | http://localhost:6500/w/api.php | mw-api-int-async-ro |
event_logger | intake-analytics.wikimedia.org | http://localhost:6004/v1/events?hasty=true | eventgate-analytics |
Oct 19 2023
We resolved the envoy proxy issues as shown in T348607#9264192. However, the rec-api instance hosted on LiftWing is still not performing as expected. The container logs on staging show that the worker processes are being terminated when the rec-api is processing a large number of results (501) from the "morelike" external endpoint.
Wed Oct 18 10:32:52 2023 - *** HARAKIRI ON WORKER 1 (pid: 134, try: 1) *** Wed Oct 18 10:32:52 2023 - HARAKIRI !!! worker 1 status !!! Wed Oct 18 10:32:52 2023 - HARAKIRI [core 0] 127.0.0.1 - GET /api/?s=en&t=fr&n=3&article=Apple since 1697625156 Wed Oct 18 10:32:52 2023 - HARAKIRI !!! end of worker 1 status !!! 2023-10-18 10:32:52,799 recommendation.utils.event_logger log_api_request():39 INFO -- Logging event: {"schema": "TranslationRecommendationAPIRequests", "$schema": "/analytics/legacy/$translationrecommendationapirequests/1.0.0", "revision": 16261139, "event": {"timestamp": 1697625172, "sourceLanguage": "en", "targetLanguage": "fr", "seed": "Apple", "searchAlgorithm": "morelike"}, "webHost": "recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443", "client_dt": "2023-10-18T10:32:52.799807", "meta": {"stream": "eventlogging_TranslationRecommendationAPIRequests", "domain": "recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443"}} 2023-10-18 10:32:53,089 recommendation.api.types.translation.candidate_finders get_morelike_candidates():39 INFO -- morelike returned 501 results DAMN ! worker 1 (pid: 134) died, killed by signal 9 :( trying respawn ... Respawned uWSGI worker 1 (new pid: 183) [pid: 183|app: 0|req: 8/35] 10.192.0.201 () {30 vars in 376 bytes} [Wed Oct 18 10:33:02 2023] GET /api/spec => generated 1908 bytes in 4 msecs (HTTP/1.1 200) 5 headers in 198 bytes (1 switches on core 0) Wed Oct 18 10:33:08 2023 - *** HARAKIRI ON WORKER 2 (pid: 166, try: 1) *** Wed Oct 18 10:33:08 2023 - HARAKIRI !!! worker 2 status !!! Wed Oct 18 10:33:08 2023 - HARAKIRI [core 0] 127.0.0.1 - GET /api/?s=en&t=fr&n=3&article=Apple since 1697625172 Wed Oct 18 10:33:08 2023 - HARAKIRI !!! end of worker 2 status !!! DAMN ! worker 2 (pid: 166) died, killed by signal 9 :( trying respawn ... Respawned uWSGI worker 2 (new pid: 187)
We are going to experiment with different resource limits and monitor the rec-api-ng performance on staging to identify the optimal settings that will enable the API to return results and perform as expected.
We configured envoy settings and the rec-api-ng container hosted on LiftWing is able to access endpoints external to k8s/LiftWing. Below are the settings we configured:
endpoint name | enpoint host header | envoy listener | port |
language_pairs | cxserver.wikimedia.org | cxserver | 6015 |
pageviews | wikimedia.org | mw-api-int-async-ro | 6500 |
wikipedia | {source}.wikipedia.org | mw-api-int-async-ro | 6500 |
wikidata | www.wikidata.org | mw-api-int-async-ro | 6500 |
event_logger | intake-analytics.wikimedia.org | eventgate-analytics | 6004 |
Oct 11 2023
Following T347475#9235065, it was clear that the rec-api-ng instance on LiftWing doesn't work like the rec-api container outside LiftWing, despite having similar resources allocated. We did some troubleshooting with @elukey on IRC and discovered that the container on LiftWing can not access wikimedia, wikidata, and wikipedia endpoints. In T348607 we are configuring envoy to enable this instance to access endpoints external to k8s/LiftWing.
Oct 9 2023
Thanks @isarantopoulos and @Isaac for the suggestions!
Oct 5 2023
@Trizek-WMF, thank you so much for sharing this script that helps to curb overlinking. I am looping in @MGerlach, since he will work on improving add-a-link model performance, this might interest him.
Oct 4 2023
Thanks @elukey. Currently, these are the resources allocated to the recommendation-api pod. I am not aware of the amount of resources available on LiftWing. Is it possible to increase the resources allocated to the pod so that we can eliminate the possibility of resource constraints affecting the performance of the endpoint?
Oct 3 2023
Investigation Report
Without being able to fully test the recommendation-api on wmflabs and LiftWing, I ran a couple of experiments to investigate the cause of the performance disparity between the recommendation-api hosted on wmflabs and LiftWing, which is causing the internal endpoint to hang (as shown in this task's description). The steps I took and the results achieved are detailed below.
Oct 2 2023
Sep 27 2023
Sep 26 2023
We finally got rec-api deployment settings that could run on LiftWing:
- a 4th test was run and this dropped the memory usage by 5x
- we disabled the related_articles service (as advised by the research team), the rec-api ended up running without the embedding.
- deployed the rec-api on staging and then finally in production
Also added the user-agent header in the the LiftWing usage docs for all examples (curl, python, and JS):
JavaScript examples have been added to the LiftWing API gateway docs as shown below:
- revscoring goodfaith prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_goodfaith_prediction#Examples
- revscoring damaging prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_damaging_prediction#Examples
- revscoring reverted prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_reverted_prediction#Examples
- revscoring drafttopic prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_drafttopic_prediction#Examples
- revscoring draftquality prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_draftquality_prediction#Examples
- revscoring articlequality prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_articlequality_prediction#Examples
- revscoring articletopic prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_articletopic_prediction#Examples
- reverted risk multilingual prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_reverted_risk_multilingual_prediction#Examples
- revert risk language agnostic prediction: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_reverted_risk_language_agnostic_prediction#Examples
- articletopic outlink prediction : https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_articletopic_outlink_prediction#Examples#ooui-5
Sep 25 2023
Hi @Isaac, @santhosh, and @Seddon. The ML team was assigned the task of migrating the recommendation-api from wmflabs to LiftWing. We have successfully deployed the recommendation-api on LiftWing.
The recommendation-api-ng has successfully been deployed to LiftWing production in both eqiad and codfw:
kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kube_env recommendation-api-ng ml-serve-eqiad kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kubectl get pods NAME READY STATUS RESTARTS AGE recommendation-api-ng-main-5c4f58c685-jts5q 2/2 Running 0 8m24s recommendation-api-ng-main-5c4f58c685-m4mfs 2/2 Running 0 8m24s recommendation-api-ng-main-5c4f58c685-q8vdn 2/2 Running 0 8m24s recommendation-api-ng-main-5c4f58c685-snn99 2/2 Running 0 8m24s recommendation-api-ng-main-5c4f58c685-tfdmt 2/2 Running 0 8m24s ... kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kube_env recommendation-api-ng ml-serve-codfw kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kubectl get pods NAME READY STATUS RESTARTS AGE recommendation-api-ng-main-5c4f58c685-dc22j 2/2 Running 0 7m37s recommendation-api-ng-main-5c4f58c685-mqm97 2/2 Running 0 7m37s recommendation-api-ng-main-5c4f58c685-s47l8 2/2 Running 0 7m37s recommendation-api-ng-main-5c4f58c685-w8s2p 2/2 Running 0 7m37s recommendation-api-ng-main-5c4f58c685-zw9tx 2/2 Running 0 7m37s
Sep 21 2023
The recommendation-api-ng has been successfully deployed on LiftWing staging:
kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services$ curl https://recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443/api/spec {"basePath":"/api","consumes":["application/json"],"definitions":{"Article":{"properties":{"pageviews":{"description":"pageviews","type":"integer"},"rank":{"description":"rank","type":"number"},"title":{"description":"title","type":"string"},"wikidata_id":{"description":"wikidata_id","type":"string"}},"required":["rank","title","wikidata_id"],"type":"object"}},"info":{"title":"API","version":"1.0"},"paths":{"/":{"get":{"deprecated":true,"description":"Gets recommendations of source articles that are missing in the target","operationId":"get_legacy_article","parameters":[{"description":"Source wiki project language code","in":"query","name":"s","required":true,"type":"string"},{"description":"Target wiki project language code","in":"query","name":"t","required":true,"type":"string"},{"default":12,"description":"Number of recommendations to fetch","in":"query","maximum":24,"minimum":0,"name":"n","type":"integer"},{"description":"Seed article for personalized recommendations that can also be a list separated by \"|\"","in":"query","name":"article","pattern":"^([^|]+(\\|[^|]+)*)?$","type":"string"},{"default":true,"description":"Whether to include pageview counts","in":"query","name":"pageviews","type":"boolean"},{"collectionFormat":"multi","default":"morelike","description":"Which search algorithm to use if a seed is specified","enum":["morelike","wiki"],"in":"query","name":"search","type":"string"}],"responses":{"200":{"description":"Success","schema":{"items":{"$ref":"#/definitions/Article"},"type":"array"}}},"tags":["default"]}},"/spec":{"get":{"operationId":"get_spec","responses":{"200":{"description":"Success"}},"tags":["default"]}}},"produces":["application/json"],"responses":{"MaskError":{"description":"When any error occurs on mask"},"ParseError":{"description":"When a mask can't be parsed"}},"swagger":"2.0","tags":[{"description":"Default namespace","name":"default"}]}
On IRC, @elukey advised that we use deploy2002 due to a recent datacenter switchover. The second attempt to deploy on staging returned:
kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ helmfile -e ml-staging-codfw sync Affected releases are: main (wmf-stable/python-webapp) UPDATED
The attempt to deploy the rec-api on LW staging returned:
kevinbazira@deploy1002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ helmfile -e ml-staging-codfw sync Affected releases are: main (wmf-stable/python-webapp) UPDATED
Sep 20 2023
@Isaac, thank you for the pointer. I have tested the recommendation-api with related_articles switched off and it ran without the errors.
Sep 19 2023
Thank you for sharing this information, @Isaac.
Sep 15 2023
The recommendation-api preprocessed numpy binaries were uploaded successfully to Thanos Swift. Below are their storage URIs:
file | uri |
wikidata_ids.npy | s3://wmf-ml-models/recommendation_api/enwiki/20230915010659/wikidata_ids.npy |
decoded_lines_float32.npy | s3://wmf-ml-models/recommendation_api/enwiki/20230915011054/decoded_lines_float32.npy |
Sep 13 2023
A 4th memory usage test was run using a combination of the 2nd (float downcasted to np.float32) and 3rd (preprocessed numpy arrays). Below are the steps taken and the results:
Sep 12 2023
We are working on making adjustments to the rec-api deployment settings until we get to a state that can run on LiftWing.
As suggested in T339890#9156420, I run the load_raw_embedding method and saved both wikidata_ids and decoded_lines numpy arrays:
... np.save('wikidata_ids.npy', self.wikidata_ids) np.save('decoded_lines.npy', self.embedding) . . . root@a0efd763f796:/home/recommendation-api# ls -lh -rw-r--r-- 1 root root 2.4G Sep 12 06:53 decoded_lines.npy -rw-r--r-- 1 root root 107M Sep 12 06:53 wikidata_ids.npy
Sep 11 2023
@calbon, suggested that we use dtype=np.float32 to reduce memory usage. I have tested it and below are the results:
dtype | on-load | steady-state |
float | ~7GB | ~5GB |
np.float32 | ~3.8GB | ~2.8GB |
@elukey, on IRC you mentioned:
@elukey, regarding the rec-api memory usage, please see the findings below got from docker stats after monitoring the rec-api container running locally:
Sep 8 2023
Deployment settings for the recommendation-api-ng have been merged but when we try to deploy on staging we get:
Sep 5 2023
In T339890, we are working on hosting the recommendation-api container on LiftWing. Tasks completed so far are:
- stored the embedding on swift
- added a swift client to the recommendation-api
- created CI jobs named recommendation-api-ng
- published the rec-api image to the Wikimedia docker registry
In T338805, we containerized the Flask web application that runs the Content Translation Recommendation API. Here is a summary of the steps that were taken:
- created a docker image after wrangling dependencies and configurations that were set in 2016
- fetched the embedding from figshare and hosted it locally in the container
- tested the backend and were able to hit the recommendation-api endpoint
- set up the frontend and were able to interact with GapFinder
Aug 31 2023
To prevent mishaps like T344319#9109329 in the future, we have automated the unpublishing process of add-a-link datasets using this script: https://github.com/wikimedia/research-mwaddlink/blob/main/unpublish-datasets.sh
Moving forward, to unpublish a given wiki's datasets, one can run the following command:
WIKI_ID=<WIKI_ID> ./unpublish-datasets.sh
Aug 30 2023
The model training pipeline has been fixed and it now generates all the expected datasets that should be published. Below is a summary of what was fixed:
- use conda env (python3.10) to run spark jobs
- spark commands distribute conda env as an archive to the jobs
- use python3.7 env to:
- train wikipedia2vec and filter its output
- enable sqlitedict to use pickle protocol 4 instead of 5
- update README docs on setting up the 2 python envs mentioned above
- adapt requirements.txt dependency versions to support python3.7
- explicitly set protocol 4 in both generate_anchor_dictionary and generate_wdproperties_spark scripts that were still using pickle protocol 5 via HIGHEST_PROTOCOL.
Aug 29 2023
Following T345091#9124589, we learned that the sqlitedict package was using pickle protocol 5 via HIGHEST_PROTOCOL within the python3.10 environment. To avoid running into the unsupported pickle protocol: 5 issue, we are going to run all scripts that rely on this package within the python3.7 environment.
Great. Thank you @Urbanecm_WMF and @MGerlach for resolving this while I was away. Sorry for the inconvenience.
Aug 28 2023
We adapted the training pipeline:
- spark jobs run within the conda env (python3.10)
- spark commands distribute conda env as an archive to the jobs
- python3.7 env is used to train wikipedia2vec and filter its output
Aug 25 2023
The next step was to adapt the entire training pipeline (i.e run-pipeline.sh) to use a packed env and on running it, the error below was thrown:
$ WIKI_ID="simplewiki" ./run-pipeline.sh . . . RUNNING wikipedia2vec on dump Traceback (most recent call last): File "/home/kevinbazira/.conda/envs/link-recommendation-env/bin/wikipedia2vec", line 5, in <module> from wikipedia2vec.cli import cli File "/home/kevinbazira/.conda/envs/link-recommendation-env/lib/python3.10/site-packages/wikipedia2vec/__init__.py", line 4, in <module> from .dictionary import Dictionary File "wikipedia2vec/dictionary.pyx", line 1, in init wikipedia2vec.dictionary ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
@MGerlach and I looked into why the spark jobs in the training pipeline weren't running and discovered that they weren't able to run python even when it existed. The error below was being thrown:
(an-worker1092.eqiad.wmnet executor 1): java.io.IOException: Cannot run program "/home/kevinbazira/.conda/envs/adda-link-env/bin/python3": error=2, No such file or directory
Thank you for the clarification @nshahquinn-wmf.
Aug 24 2023
Stepped through the execution of the training pipeline and discovered that generate_anchor_dictionary_spark.py does not generate the expected files. This is likely to be caused by a spark session that no longer connects even when kerberos credentials are enabled on stat1008.
Aug 23 2023
Aug 22 2023
Thank you @Sgs and @Urbanecm_WMF. aswiki has been removed from the published datasets repo and wikis.txt.
Aug 21 2023
ganwiki and krcwiki datasets have been removed from the published datasets repo.
Aug 18 2023
Thank you for the follow-up, @Sgs. We are going to go ahead and remove ganwiki and krcwiki as we wait for aswiki.
Aug 16 2023
Aug 11 2023
After encountering issues with CI fetching files from the Wikimedia public datasets archive (T341582) and CI post-merge build failure due to an internal server error (T342084). Finally the recommendation-api image has been published to the Docker registry: https://docker-registry.wikimedia.org/wikimedia/research-recommendation-api/tags/. The next step is to deploy it onto LiftWing/k8s.
Aug 10 2023
Thanks @elukey, indeed there was a delay for the docker-registry website to sync with the registry. After a couple of hours, the research-recommendation-api image is now visible at https://docker-registry.wikimedia.org/wikimedia/research-recommendation-api/tags/
WMF currently uses Thanos Swift for object storage, but there is a plan to migrate to MOSS, which may run Ceph. Ceph docs (1, 2) show that it supports the Swift API. This Swift client will be able to work for us on both Thanos Swift and MOSS Ceph.
Aug 9 2023
A Swift client has been added to the recommendation-api image to enable it to fetch embeddings from Swift.
To reduce image layer sizes, in T343576 we store and fetch the ~2.8GB recommendation-api embedding from Swift as recommended in T288198#9037109. This has enabled the post-merge build to succeed: https://integration.wikimedia.org/ci/job/trigger-recommendation-api-ng-pipeline-publish/4/console