Page MenuHomePhabricator

kevinbazira (Kevin Bazira, KBazira)
Software Engineer (Machine Learning)

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Aug 3 2019, 6:58 AM (241 w, 2 d)
Availability
Available
IRC Nick
kevinbazira
LDAP User
Kevin Bazira
MediaWiki User
KBazira (WMF) [ Global Accounts ]

Recent Activity

Yesterday

kevinbazira added a comment to T360177: Support building and running of articletopic-outlink model-server via Makefile.

Running into the error below which is caused by a missing events module. This module is used to generate and send a topic prediction event to EventGate. Turns out this module is in python/events.py and the model-server can't locate it because it is not running like a python module.

Traceback (most recent call last):
  File "/home/inference-services/outlink-topic-model/model-server/model.py", line 6, in <module>
    import events
ModuleNotFoundError: No module named 'events'
make[1]: *** [Makefile:97: run-server] Error 1
make[1]: Leaving directory '/home/inference-services'
make: *** [Makefile:76: articletopic-outlink] Error 2
Mon, Mar 18, 10:21 AM · Machine-Learning-Team

Fri, Mar 15

kevinbazira committed rMLIS5929349671c0: articletopic-outlink: load model path from environment variable (authored by kevinbazira).
articletopic-outlink: load model path from environment variable
Fri, Mar 15, 4:00 PM
kevinbazira added a comment to T360177: Support building and running of articletopic-outlink model-server via Makefile.

Currently, the method that loads a model has a hardcoded model path. When we set the path through an environmental variable, the error below is thrown. To resolve this, we need to refactor the model server so that it can accept the model path through an environmental variable, similar to how other model servers operate.

Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
Traceback (most recent call last):
  File "/home/inference-services/outlink-topic-model/model-server/model.py", line 104, in <module>
    model = OutlinksTopicModel("outlink-topic-model")
  File "/home/inference-services/outlink-topic-model/model-server/model.py", line 27, in __init__
    self.load()
  File "/home/inference-services/outlink-topic-model/model-server/model.py", line 45, in load
    self.model = fasttext.load_model("/mnt/models/model.bin")
  File "/home/inference-services/my_venv/lib/python3.9/site-packages/fasttext/FastText.py", line 441, in load_model
    return _FastText(model_path=path)
  File "/home/inference-services/my_venv/lib/python3.9/site-packages/fasttext/FastText.py", line 98, in __init__
    self.f.loadModel(model_path)
ValueError: /mnt/models/model.bin cannot be opened for loading!
make[1]: *** [Makefile:97: run-server] Error 1
make[1]: Leaving directory '/home/inference-services'
make: *** [Makefile:76: articletopic-outlink] Error 2
Fri, Mar 15, 3:18 PM · Machine-Learning-Team
kevinbazira added a comment to T360177: Support building and running of articletopic-outlink model-server via Makefile.

While building the articletopic-outlink model-server locally, the error below is thrown. We encountered a similar error in T357382#9536821 and resolved it by adding the wheel package to the requirements.txt before installing fasttext.

Collecting fasttext==0.9.2 (from -r outlink-topic-model/model-server/requirements.txt (line 30))
  Using cached fasttext-0.9.2.tar.gz (68 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
Fri, Mar 15, 9:12 AM · Machine-Learning-Team
kevinbazira created T360177: Support building and running of articletopic-outlink model-server via Makefile.
Fri, Mar 15, 9:07 AM · Machine-Learning-Team

Thu, Mar 14

kevinbazira committed rMLIS388d8f4bae58: RRLA: upgrade KI from v5 to v6 (authored by kevinbazira).
RRLA: upgrade KI from v5 to v6
Thu, Mar 14, 2:47 PM
kevinbazira added a comment to T358676: Host a logo detection model for Commons images.

Thank you for providing details about the logo detection project, @mfossati! The ML team is excited to explore hosting it on LiftWing.

Thu, Mar 14, 9:23 AM · Structured-Data-Backlog, Machine-Learning-Team

Wed, Mar 13

kevinbazira edited P58692 RRLA model-server: test envs to assess runtime performance between KI v5 vs KI v6.
Wed, Mar 13, 6:59 AM · Machine-Learning-Team

Mon, Mar 11

kevinbazira added a comment to T355742: Assess runtime performance impact of pydantic data models in the RRLA model-server.

@MunizaA, we're happy to hear that the information provided was helpful. For more context, the preprocessing time for each payload was recorded after every request made in both KI v5 and v6 environments. This is why the average value for the Preprocess Runtime (s) column was calculated in the last row of the table.

Mon, Mar 11, 10:48 AM · Patch-For-Review, Machine-Learning-Team

Fri, Mar 8

kevinbazira added a comment to T355742: Assess runtime performance impact of pydantic data models in the RRLA model-server.

I noticed that in KI v6, pydantic data models were added to the BaseRevision class in the knowledge integrity schema. The get_revision method used in RRLA relies on the Revision class, which inherits from BaseRevision. Since RRLA uses this method in the preprocess step, I have compared the runtime of the preprocess step for two model servers: one running KI v5 (commit: 026c11a7b3bdb6bd16ef8826bc23b782e8c4e8c8) and another running KI v6 (commit: c8de64b8766e10223eabed73dad1bb2ac68c6b03). Below are the results that show the runtimes based on sample inputs that we use in RRLA load tests and test envs in P58692:

Request PayloadPreprocess Runtime (s)
KI v5KI v6
{"lang": "es", "rev_id": 144593484}0.10105657580.1377308369
{"lang": "de", "rev_id": 224199451}0.096221208570.1309299469
{"lang": "ru", "rev_id": 123744978}0.10452270510.1144728661
{"lang": "de", "rev_id": 224285471}0.11316514020.1167194843
{"lang": "en", "rev_id": 1096349097}0.14216494560.1646904945
{"lang": "pl", "rev_id": 67533865}0.12433624270.1153821945
{"lang": "en", "rev_id": 1096728668}0.16688895230.169686079
{"lang": "en", "rev_id": 1096851393}0.1228857040.1490731239
{"lang": "pl", "rev_id": 67538140}0.1063880920.1116890907
{"lang": "en", "rev_id": 1096609909}0.12728810310.1196594238
{"lang": "es", "rev_id": 144616722}0.11685585980.1163015366
{"lang": "uk", "rev_id": 36418681}0.11850929260.1388361454
{"lang": "ru", "rev_id": 123727072}0.11430597310.141433239
{"lang": "en", "rev_id": 1096855066}0.14327120780.1390919685
{"lang": "ru", "rev_id": 123758382}0.12636780740.1209347248
Average0.12161518730.132442077
Fri, Mar 8, 4:23 PM · Patch-For-Review, Machine-Learning-Team
kevinbazira created P58692 RRLA model-server: test envs to assess runtime performance between KI v5 vs KI v6.
Fri, Mar 8, 3:58 PM · Machine-Learning-Team

Thu, Mar 7

kevinbazira added a comment to T355742: Assess runtime performance impact of pydantic data models in the RRLA model-server.

Thanks @isarantopoulos, earlier on I was missing the python/kserve subdirectory. After changing:

kserve @ git+https://github.com/kserve/kserve.git@426fe21da0612ea6ef4a116b5114270313e02bbb

to

kserve @ git+https://github.com/kserve/kserve.git@426fe21da0612ea6ef4a116b5114270313e02bbb#egg=kserve&subdirectory=python/kserve

this pre-release commit was able to be installed. I also checked to confirm the recently added fastapi and pydantic:

pip list | grep -E '(fastapi|pydantic)'
fastapi                   0.108.0
pydantic                  2.6.3
pydantic_core             2.16.3
Thu, Mar 7, 1:41 PM · Patch-For-Review, Machine-Learning-Team
kevinbazira added a comment to T355742: Assess runtime performance impact of pydantic data models in the RRLA model-server.

Following the workflow we use to build LiftWing model-servers, which involves installing pip dependencies listed in the requirements.txt file. I added the above pre-release commit to the RRLA requirements.txt file, and when I ran pip install -r requirements.txt, the error below was thrown:

Collecting kserve@ git+https://github.com/kserve/kserve.git@426fe21da0612ea6ef4a116b5114270313e02bbb
  Cloning https://github.com/kserve/kserve.git (to revision 426fe21da0612ea6ef4a116b5114270313e02bbb) to /tmp/pip-install-uwt31r3m/kserve_7e7029202b4b49449c96d8b0f6a3185d
  Running command git clone -q https://github.com/kserve/kserve.git /tmp/pip-install-uwt31r3m/kserve_7e7029202b4b49449c96d8b0f6a3185d
  Running command git rev-parse -q --verify 'sha^426fe21da0612ea6ef4a116b5114270313e02bbb'
  Running command git fetch -q https://github.com/kserve/kserve.git 426fe21da0612ea6ef4a116b5114270313e02bbb
  Running command git checkout -q 426fe21da0612ea6ef4a116b5114270313e02bbb
    ERROR: Command errored out with exit status 1:
     command: /home/thevenv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-uwt31r3m/kserve_7e7029202b4b49449c96d8b0f6a3185d/setup.py'"'"'; __file__='"'"'/tmp/pip-install-uwt31r3m/kserve_7e7029202b4b49449c96d8b0f6a3185d/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-nd559lke
         cwd: /tmp/pip-install-uwt31r3m/kserve_7e7029202b4b49449c96d8b0f6a3185d/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/usr/lib/python3.9/tokenize.py", line 392, in open
        buffer = _builtin_open(filename, 'rb')
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-uwt31r3m/kserve_7e7029202b4b49449c96d8b0f6a3185d/setup.py'
    ----------------------------------------
WARNING: Discarding git+https://github.com/kserve/kserve.git@426fe21da0612ea6ef4a116b5114270313e02bbb. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement kserve (unavailable)
ERROR: No matching distribution found for kserve (unavailable)
Thu, Mar 7, 11:35 AM · Patch-For-Review, Machine-Learning-Team

Wed, Mar 6

kevinbazira added a comment to T343123: Migrate Machine-generated Article Descriptions from toolforge to liftwing..

@Seddon and @Isaac, the article-descriptions inference service is now live in LiftWing production. It can be accessed through:
1.External endpoint:

curl "https://api.wikimedia.org/service/lw/inference/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}'

2.Internal endpoint:

curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H  "Host: article-descriptions.article-descriptions.wikimedia.org"
Wed, Mar 6, 2:56 PM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24), Machine-Learning-Team

Fri, Mar 1

kevinbazira added a comment to T358842: Investigate why WikiGPT returns an Internal Server Error.

It looks like both Chris' and WMF's API keys do not currently have access to the latest openAI models. I ended up using an older model (gpt-3.5-turbo) and the search query now returns results as shown below:

wiki-gpt.toolforge.org_search.png (3×1 px, 1 MB)

Fri, Mar 1, 3:45 PM · Machine-Learning-Team
kevinbazira added a comment to T358842: Investigate why WikiGPT returns an Internal Server Error.

Using the WMF openAI account, I created a wikigpt API key. When I used it in the application, the error below was thrown:

ERROR:root:The model `gpt-4` does not exist or you do not have access to it. Learn more: https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4.
ERROR:app:Exception on /search [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "app.py", line 119, in search
    search_query
  File "app.py", line 84, in queryWikiGPT
    search_results = response["choices"][0]["message"]["content"]
KeyError: 'message'
172.17.0.1 - - [01/Mar/2024 10:01:00] "POST /search HTTP/1.1" 500 -
INFO:werkzeug:172.17.0.1 - - [01/Mar/2024 10:01:00] "POST /search HTTP/1.1" 500 -
Fri, Mar 1, 12:00 PM · Machine-Learning-Team
kevinbazira added a comment to T358842: Investigate why WikiGPT returns an Internal Server Error.

I dug into the server logs and found that we are receiving a rate limit error from the openAI API:

ERROR:root:You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.
ERROR:app:Exception on /search [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "app.py", line 119, in search
    search_query
  File "app.py", line 84, in queryWikiGPT
    search_results = response["choices"][0]["message"]["content"]
KeyError: 'message'
172.17.0.1 - - [01/Mar/2024 09:27:54] "POST /search HTTP/1.1" 500 -
INFO:werkzeug:172.17.0.1 - - [01/Mar/2024 09:27:54] "POST /search HTTP/1.1" 500 -
Fri, Mar 1, 10:30 AM · Machine-Learning-Team
kevinbazira changed the status of T358842: Investigate why WikiGPT returns an Internal Server Error, a subtask of T328494: WikiGPT Experiment, from Open to In Progress.
Fri, Mar 1, 10:21 AM · Epic, Machine-Learning-Team
kevinbazira changed the status of T358842: Investigate why WikiGPT returns an Internal Server Error from Open to In Progress.
Fri, Mar 1, 10:21 AM · Machine-Learning-Team
kevinbazira created T358842: Investigate why WikiGPT returns an Internal Server Error.
Fri, Mar 1, 10:18 AM · Machine-Learning-Team

Thu, Feb 29

kevinbazira added a comment to T358742: Investigate InfServiceHighMemoryUsage for article-descriptions.

Based on this InfServiceHighMemoryUsage filter the alerts are triggered by both the article-descriptions-predictor-default-00006-deployment-79ffz6hsite pod in codfw and the article-descriptions-predictor-default-00005-deployment-64gglffsite pod in eqiad.

Thu, Feb 29, 8:18 AM · Machine-Learning-Team

Wed, Feb 28

kevinbazira created T358655: Set SLO for the article-descriptions isvc hosted on LiftWing.
Wed, Feb 28, 11:33 AM · Machine-Learning-Team
kevinbazira created T358654: Create external endpoint for article-descriptions isvc hosted on LiftWing.
Wed, Feb 28, 11:25 AM · Patch-For-Review, Machine-Learning-Team
kevinbazira added a comment to T358467: Move the article-descriptions model server from staging to production.

The article-descriptions model server was firing InfServiceHighMemoryUsage alerts. This usually happens when an isvc uses >90% of its limit for 5mins. I have increased the memory limit used by this model server from 4Gi to 5Gi so that prod can handle processing more isvc requests without running out of memory.

Wed, Feb 28, 11:09 AM · Machine-Learning-Team

Tue, Feb 27

kevinbazira created P57998 RRML isvc TextClassificationPipeline error.
Tue, Feb 27, 1:14 PM · Machine-Learning-Team
kevinbazira added a comment to T358467: Move the article-descriptions model server from staging to production.

@klausman helped increase the caps on this model server's resource constraints. I pushed a patch that increased the number of CPUs used by the article-descriptions model server from 6 to 16 so that prod can match staging performance. The previous request we tested in T358467#9579190 has dropped from >8s to <3s:

$ time curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}' -H  "Host: article-descriptions.article-descriptions.wikimedia.org"
{"lang":"en","title":"Clandonald","blp":false,"num_beams":2,"groundtruth":"Hamlet in Alberta, Canada","latency":{"wikidata-info (s)":0.043555498123168945,"mwapi - first paragraphs (s)":0.22956347465515137,"total network (s)":0.2606837749481201,"model (s)":2.4616105556488037,"total (s)":2.7223129272460938},"features":{"descriptions":{"fr":"hameau d'Alberta","en":"hamlet in central Alberta, Canada"},"first-paragraphs":{"en":"Clandonald is a hamlet in central Alberta, Canada within the County of Vermilion River. It is located approximately 28 kilometres (17 mi) north of Highway 16 and 58 kilometres (36 mi) northwest of Lloydminster.","fr":"Clandonald est un hameau (hamlet) du Comté de Vermilion River, situé dans la province canadienne d'Alberta."}},"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"]}
real	0m2.744s
user	0m0.000s
sys	0m0.013s
Tue, Feb 27, 10:49 AM · Machine-Learning-Team
kevinbazira added a comment to T358467: Move the article-descriptions model server from staging to production.

Thanks @klausman. As discussed yesterday, with the current configuration, a request that was taking <3s on staging is now >8s in prod as shown below:

$ time curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}' -H  "Host: article-descriptions.article-descriptions.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"lang":"en","title":"Clandonald","blp":false,"num_beams":2,"groundtruth":"Hamlet in Alberta, Canada","latency":{"wikidata-info (s)":0.07287430763244629,"mwapi - first paragraphs (s)":0.27934741973876953,"total network (s)":0.3130209445953369,"model (s)":7.659532070159912,"total (s)":7.9725823402404785},"features":{"descriptions":{"fr":"hameau d'Alberta","en":"hamlet in central Alberta, Canada"},"first-paragraphs":{"en":"Clandonald is a hamlet in central Alberta, Canada within the County of Vermilion River. It is located approximately 28 kilometres (17 mi) north of Highway 16 and 58 kilometres (36 mi) northwest of Lloydminster.","fr":"Clandonald est un hameau (hamlet) du Comté de Vermilion River, situé dans la province canadienne d'Alberta."}},"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"]}
real	0m8.049s
user	0m0.014s
sys	0m0.001s
Tue, Feb 27, 7:23 AM · Machine-Learning-Team

Mon, Feb 26

kevinbazira updated subscribers of T358467: Move the article-descriptions model server from staging to production.

After @klausman helped add secrets, deploy configs, and certs we are now getting this error:

$ helmfile -e ml-serve-eqiad diff
skipping missing values file matching "values-ml-serve-eqiad.yaml"
skipping missing values file matching "values-main.yaml"
Comparing release=service-secrets, chart=wmf-stable/secrets
Comparing release=main, chart=wmf-stable/kserve-inference
in ./helmfile.yaml: 2 errors:
err 0: command "/usr/bin/helm3" exited with non-zero status:
Mon, Feb 26, 2:18 PM · Machine-Learning-Team
kevinbazira added a comment to T358467: Move the article-descriptions model server from staging to production.

Before deploying the article-descriptions model server in prod, I tried running helmfile -e ml-serve-* diff for both *eqiad and *codfw and got the error below:

$ helmfile -e ml-serve-eqiad diff
skipping missing values file matching "/etc/helmfile-defaults/private/ml-serve_services/article-descriptions/ml-serve-eqiad.yaml"
skipping missing values file matching "/etc/helmfile-defaults/private/ml-serve_services/article-descriptions/ml-serve-eqiad.yaml"
skipping missing values file matching "values-ml-serve-eqiad.yaml"
skipping missing values file matching "values-main.yaml"
Comparing release=service-secrets, chart=wmf-stable/secrets
Comparing release=main, chart=wmf-stable/kserve-inference
in ./helmfile.yaml: 2 errors:
err 0: command "/usr/bin/helm3" exited with non-zero status:
Mon, Feb 26, 12:42 PM · Machine-Learning-Team
kevinbazira moved T358467: Move the article-descriptions model server from staging to production from Unsorted to In Progress on the Machine-Learning-Team board.
Mon, Feb 26, 8:10 AM · Machine-Learning-Team
kevinbazira changed the status of T358467: Move the article-descriptions model server from staging to production from Open to In Progress.
Mon, Feb 26, 8:10 AM · Machine-Learning-Team
kevinbazira changed the status of T358467: Move the article-descriptions model server from staging to production, a subtask of T343123: Migrate Machine-generated Article Descriptions from toolforge to liftwing., from Open to In Progress.
Mon, Feb 26, 8:09 AM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24), Machine-Learning-Team
kevinbazira created T358467: Move the article-descriptions model server from staging to production.
Mon, Feb 26, 8:08 AM · Machine-Learning-Team

Fri, Feb 23

kevinbazira moved T357913: Support building and running of readability model-server via Makefile from In Progress to Done on the Machine-Learning-Team board.
Fri, Feb 23, 8:20 AM · Machine-Learning-Team
kevinbazira closed T357913: Support building and running of readability model-server via Makefile, a subtask of T352689: Add a script for running the Revert Risk model server locally, as Resolved.
Fri, Feb 23, 8:18 AM · Machine-Learning-Team
kevinbazira closed T357913: Support building and running of readability model-server via Makefile as Resolved.

Support for building the readability model-server using the Makefile was added and it can be tested using:

# first terminal
$ make readability
# second terminal
$ curl localhost:8080/v1/models/readability:predict -X POST -d '{"rev_id": 123456, "lang": "en"}' -H "Content-type: application/json"
$ MODEL_TYPE=readability make clean
Fri, Feb 23, 8:18 AM · Machine-Learning-Team

Thu, Feb 22

kevinbazira added a comment to P57453 article-descriptions: evaluate `preprocess()` and `predict()` runtime on LiftWing.

@Isaac, you're spot on! The main difference to note is that when we're running the model server on LiftWing, it accesses the REST endpoint via the Rest Gateway using http://rest-gateway.discovery.wmnet:4111/{lang}.wikipedia.org/v1/page/summary/{title}. However, when we're running the model server locally, it accesses the REST endpoint via the Wikimedia REST API using https://{lang}.wikipedia.org/api/rest_v1/page/summary/{title}.

Thu, Feb 22, 11:47 AM · Machine-Learning-Team
kevinbazira added a comment to T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.

I have run the second API call using the same LocalServer on the ml-sanbox (as used in P57453#232415). Below are the results:

$ time curl https://es.wikipedia.org/api/rest_v1/page/summary/Madrid
{"type":"standard","title":"Madrid","displaytitle":"<span class=\"mw-page-title-main\">Madrid</span>","namespace":{"id":0,"text":""},"wikibase_item":"Q2807","titles":{"canonical":"Madrid","normalized":"Madrid","display":"<span class=\"mw-page-title-main\">Madrid</span>"},"pageid":1791,"thumbnail":{"source":"https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Bandera_de_la_ciudad_de_Madrid.svg/langes-320px-Bandera_de_la_ciudad_de_Madrid.svg.png","width":320,"height":213},"originalimage":{"source":"https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Bandera_de_la_ciudad_de_Madrid.svg/langes-1500px-Bandera_de_la_ciudad_de_Madrid.svg.png","width":1500,"height":1000},"lang":"es","dir":"ltr","revision":"158343028","tid":"8ade1a10-d0ba-11ee-b43f-d31ec09a3bed","timestamp":"2024-02-21T13:09:02Z","description":"capital y municipio más poblado de España","description_source":"central","coordinates":{"lat":40.41694444,"lon":-3.70333333},"content_urls":{"desktop":{"page":"https://es.wikipedia.org/wiki/Madrid","revisions":"https://es.wikipedia.org/wiki/Madrid?action=history","edit":"https://es.wikipedia.org/wiki/Madrid?action=edit","talk":"https://es.wikipedia.org/wiki/Discusi%C3%B3n:Madrid"},"mobile":{"page":"https://es.m.wikipedia.org/wiki/Madrid","revisions":"https://es.m.wikipedia.org/wiki/Special:History/Madrid","edit":"https://es.m.wikipedia.org/wiki/Madrid?action=edit","talk":"https://es.m.wikipedia.org/wiki/Discusi%C3%B3n:Madrid"}},"extract":"Madrid es un municipio y una ciudad de España. La localidad, con categoría histórica de villa, es la capital del Estado y de la Comunidad de Madrid. En su término municipal, el más poblado de España, están empadronadas 3 280 782 personas, constituyéndose como la segunda ciudad más poblada de la Unión Europea, así como su área metropolitana, con 6 779 888 habitantes empadronados.","extract_html":"<p><b>Madrid</b> es un municipio y una ciudad de España. La localidad, con categoría histórica de villa, es la capital del Estado y de la Comunidad de Madrid. En su término municipal, el más poblado de España, están empadronadas <span>3 280 782 personas</span>, constituyéndose como la segunda ciudad más poblada de la Unión Europea, así como su área metropolitana, con <span>6 779 888 habitantes</span> empadronados.</p>"}
real	0m0.047s
user	0m0.025s
sys	0m0.017s
Thu, Feb 22, 11:12 AM · Wikipedia-Android-App-Backlog, Machine-Learning-Team
kevinbazira updated subscribers of T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.

In yesterday's meeting IIRC @klausman mentioned testing direct API calls to confirm whether the Rest Gateway endpoint used by LiftWing is slower. Here are some API calls we could use to test this:
1.Run within the article-descriptions model server hosted on LiftWing in the experimental namespace.

time curl http://rest-gateway.discovery.wmnet:4111/es.wikipedia.org/v1/page/summary/Madrid

2.Run within the article-descriptions model server hosted outside LiftWing.

time curl https://es.wikipedia.org/api/rest_v1/page/summary/Madrid
Thu, Feb 22, 10:55 AM · Wikipedia-Android-App-Backlog, Machine-Learning-Team

Wed, Feb 21

kevinbazira added a comment to P57453 article-descriptions: evaluate `preprocess()` and `predict()` runtime on LiftWing.

To further understand the latency on LiftWing, I looked at backends and p0.99 in this grafana dashboard: https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?from=now-3h&orgId=1&to=now&var-backend=rest-gateway.discovery.wmnet&var-cluster=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=experimental&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&var-response_code=All

Wed, Feb 21, 3:18 PM · Machine-Learning-Team
kevinbazira updated subscribers of P57453 article-descriptions: evaluate `preprocess()` and `predict()` runtime on LiftWing.

After having a chat with @isarantopoulos on IRC, I ran a comparison between the preprocess() and predict() methods. I used the sample inputs in P54507 to determine if there was a discrepancy between LiftWing and a LocalServer running on the ml-sandbox.

Wed, Feb 21, 12:32 PM · Machine-Learning-Team
kevinbazira added a comment to P57453 article-descriptions: evaluate `preprocess()` and `predict()` runtime on LiftWing.

According to T343123#9520331, the bottleneck is the preprocess step. However, the formatted JSON response shown above and the code profiling we did in T353127#9399942 indicate that the bottleneck is the predict step.

Wed, Feb 21, 8:24 AM · Machine-Learning-Team
kevinbazira updated subscribers of P57453 article-descriptions: evaluate `preprocess()` and `predict()` runtime on LiftWing.

The preprocess() method calculates its runtime using execution_times["total network (s)"] as shown here. Based on the formatted JSON response shown above, the preprocess runtime is about 0.4s. These results match @Isaac's comment in T343123#9527462.

Wed, Feb 21, 8:23 AM · Machine-Learning-Team
kevinbazira created P57453 article-descriptions: evaluate `preprocess()` and `predict()` runtime on LiftWing.
Wed, Feb 21, 8:15 AM · Machine-Learning-Team

Tue, Feb 20

kevinbazira committed rMLISe5f2f883769a: readability: refactor model server to run as python module (authored by kevinbazira).
readability: refactor model server to run as python module
Tue, Feb 20, 5:29 PM
kevinbazira added a comment to T357913: Support building and running of readability model-server via Makefile.

In today's meeting, the team discussed T357913#9558911 and suggested renaming the readability model server parent directory to readability_model as this is the same pattern used for revscoring_model.

Tue, Feb 20, 3:36 PM · Machine-Learning-Team
kevinbazira updated subscribers of T357913: Support building and running of readability model-server via Makefile.

@isarantopoulos and I had a chat on IRC and agreed that refactoring the readability model server to run as a Python module would be beneficial. This would help standardize integration with other tools (such as Makefile) and improve maintainability. As I began looking into the refactoring process, I noticed that the readability model server already uses a readability module (here and here). This presents a conflict because when we refactor the readability model server to work as a Python module, Python cannot find classify() and load_model() methods in the new local readability module as they are in the installed readability module. To resolve this conflict, we may need to rename either the readability model server or the readability module that we import into the server.

Tue, Feb 20, 1:31 PM · Machine-Learning-Team
kevinbazira committed rMLISe16ab31d853f: Makefile: add support for readability (authored by kevinbazira).
Makefile: add support for readability
Tue, Feb 20, 11:21 AM

Mon, Feb 19

kevinbazira committed rMLIS24017c926840: readability: make model path configurable (authored by kevinbazira).
readability: make model path configurable
Mon, Feb 19, 5:54 PM
kevinbazira moved T357913: Support building and running of readability model-server via Makefile from Unsorted to In Progress on the Machine-Learning-Team board.
Mon, Feb 19, 3:36 PM · Machine-Learning-Team
kevinbazira changed the status of T357913: Support building and running of readability model-server via Makefile from Open to In Progress.
Mon, Feb 19, 3:35 PM · Machine-Learning-Team
kevinbazira changed the status of T357913: Support building and running of readability model-server via Makefile, a subtask of T352689: Add a script for running the Revert Risk model server locally, from Open to In Progress.
Mon, Feb 19, 3:35 PM · Machine-Learning-Team
kevinbazira created T357913: Support building and running of readability model-server via Makefile.
Mon, Feb 19, 3:32 PM · Machine-Learning-Team

Feb 16 2024

kevinbazira moved T357382: Support building and running of langid model-server via Makefile from In Progress to Done on the Machine-Learning-Team board.
Feb 16 2024, 5:08 PM · Machine-Learning-Team
kevinbazira closed T357382: Support building and running of langid model-server via Makefile as Resolved.

+1 on adding a note to the model card. Support for building the langid model-server using the Makefile was added and it can be tested using:

# first terminal
$ make language-identification
# second terminal
$ curl localhost:8080/v1/models/langid:predict -i -X POST -d '{"text": "Some random text in any language"}'
$ MODEL_TYPE=langid make clean
Feb 16 2024, 5:07 PM · Machine-Learning-Team
kevinbazira closed T357382: Support building and running of langid model-server via Makefile, a subtask of T352689: Add a script for running the Revert Risk model server locally, as Resolved.
Feb 16 2024, 5:07 PM · Machine-Learning-Team
kevinbazira added a comment to T357382: Support building and running of langid model-server via Makefile.

While testing the locally-built langid model-server, I queried the inference service and received some interesting results. I tested three languages (English, French, and Swahili) and found that the isvc struggled to predict English accurately when the input was a short sentence with only about four words. Here are the results of my tests:

Feb 16 2024, 7:48 AM · Machine-Learning-Team
kevinbazira added a comment to T357217: research/mwaddlink has failing CI on the main branch.

Hi @hashar, in this blubber file, we have been trying to copy files from the test to the codehealth variant but keep running into the error below:

#6 local://context
#6 sha256:177a22c08429ba9a74acccc3451f709da925e7f932e391246aa7c960ee0b7b7d
#6 DONE 0.0s
failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: rpc error: code = Unknown desc = failed to compile to LLB state: preparation: pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed

Here is the full build log: https://integration.wikimedia.org/ci/job/research-mwaddlink-pipeline-test/1023/execution/node/86/log/

Feb 16 2024, 6:48 AM · Growth-Team (Sprint 7 (Growth Team)), ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team, Add-Link, Continuous-Integration-Config

Feb 15 2024

kevinbazira added a comment to T357217: research/mwaddlink has failing CI on the main branch.

Hi @Urbanecm_WMF, both you and I share the same concerns regarding model training matching model serving requirements as I wrote in T357217#9534633.

Feb 15 2024, 12:38 PM · Growth-Team (Sprint 7 (Growth Team)), ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team, Add-Link, Continuous-Integration-Config

Feb 14 2024

kevinbazira added a comment to T357217: research/mwaddlink has failing CI on the main branch.

Finally got the test build to succeed. @kostajh and @Urbanecm_WMF please review whenever you get a minute: https://gerrit.wikimedia.org/r/1001958. Thanks!

Feb 14 2024, 2:40 PM · Growth-Team (Sprint 7 (Growth Team)), ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team, Add-Link, Continuous-Integration-Config
kevinbazira added a comment to T357217: research/mwaddlink has failing CI on the main branch.

Following T357217#9534633, I had a chat with @MGerlach and option 1 is the most feasible at the moment.

Feb 14 2024, 12:52 PM · Growth-Team (Sprint 7 (Growth Team)), ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team, Add-Link, Continuous-Integration-Config

Feb 13 2024

kevinbazira committed rMLISf31de494be56: langid: fix pybind11 missing issue (authored by kevinbazira).
langid: fix pybind11 missing issue
Feb 13 2024, 2:54 PM
kevinbazira added a comment to T357382: Support building and running of langid model-server via Makefile.

The error above has been fixed by installing the wheel package before installing fasttext. The langid requirements.txt that I used has:

kserve==0.11.2
wheel==0.42.0
fasttext==0.9.2
Feb 13 2024, 9:19 AM · Machine-Learning-Team
kevinbazira added a comment to T357382: Support building and running of langid model-server via Makefile.

Trying to build the langid model-server locally throws the error below. This seems to be caused when pip is installing fasttext==0.9.2 and can't find the pybind11 package.

Collecting fasttext==0.9.2 (from -r langid/././requirements.txt (line 2))
  Downloading fasttext-0.9.2.tar.gz (68 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68.8/68.8 kB 3.3 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
Feb 13 2024, 9:14 AM · Machine-Learning-Team
kevinbazira changed the status of T357382: Support building and running of langid model-server via Makefile, a subtask of T352689: Add a script for running the Revert Risk model server locally, from Open to In Progress.
Feb 13 2024, 9:02 AM · Machine-Learning-Team
kevinbazira changed the status of T357382: Support building and running of langid model-server via Makefile from Open to In Progress.
Feb 13 2024, 9:02 AM · Machine-Learning-Team
kevinbazira created T357382: Support building and running of langid model-server via Makefile.
Feb 13 2024, 9:00 AM · Machine-Learning-Team

Feb 12 2024

kevinbazira committed rMLIS658700365fd8: Makefile: add usage steps to RRLA README (authored by kevinbazira).
Makefile: add usage steps to RRLA README
Feb 12 2024, 5:23 PM
kevinbazira added a comment to T357217: research/mwaddlink has failing CI on the main branch.
  1. Update to debian bullseye to get python 3.9
    1. noting that stat1008 has python3.7 by default, not sure if python3.9 is available there

I checked stat1008 and python3.9 is not available as shown below:

kevinbazira@stat1008:~/fix-add-a-link-CI-deps/mwaddlink$ virtualenv -p python3.9 venv
The path python3.9 (from --python=python3.9) does not exist

The model training pipeline is run on stat1008, without python3.9, this process will be blocked.

Feb 12 2024, 5:06 PM · Growth-Team (Sprint 7 (Growth Team)), ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team, Add-Link, Continuous-Integration-Config
kevinbazira committed rMLISc0ad1676c4e7: Makefile: add usage steps to article-descriptions README (authored by kevinbazira).
Makefile: add usage steps to article-descriptions README
Feb 12 2024, 10:44 AM

Feb 9 2024

kevinbazira moved T356985: Maintain models directory structure for model-server make builds to remain consistent with the analytics repo from Unsorted to Done on the Machine-Learning-Team board.
Feb 9 2024, 7:33 AM · Machine-Learning-Team
kevinbazira closed T356985: Maintain models directory structure for model-server make builds to remain consistent with the analytics repo, a subtask of T352689: Add a script for running the Revert Risk model server locally, as Resolved.
Feb 9 2024, 7:31 AM · Machine-Learning-Team
kevinbazira closed T356985: Maintain models directory structure for model-server make builds to remain consistent with the analytics repo as Resolved.

Configurations were added to the Makefile and now it maintains the models directory structure for model-server make builds to remain consistent with the analytics repo. Below are examples on how we tested this:
1.RRLA

# first terminal
$ make revertrisk-language-agnostic
# second terminal
$ tree /models/
/models/
└── revertrisk
    └── language-agnostic
        └── 20221026144108
            └── model.pkl
$ MODEL_TYPE=revertrisk make clean
Feb 9 2024, 7:31 AM · Machine-Learning-Team

Feb 8 2024

kevinbazira committed rMLIS95ab657a1d85: Makefile: maintain models directory structure as analytics repo (authored by kevinbazira).
Makefile: maintain models directory structure as analytics repo
Feb 8 2024, 4:49 PM
kevinbazira changed the status of T356985: Maintain models directory structure for model-server make builds to remain consistent with the analytics repo, a subtask of T352689: Add a script for running the Revert Risk model server locally, from Open to In Progress.
Feb 8 2024, 11:36 AM · Machine-Learning-Team
kevinbazira changed the status of T356985: Maintain models directory structure for model-server make builds to remain consistent with the analytics repo from Open to In Progress.
Feb 8 2024, 11:36 AM · Machine-Learning-Team
kevinbazira triaged T356985: Maintain models directory structure for model-server make builds to remain consistent with the analytics repo as Medium priority.
Feb 8 2024, 11:35 AM · Machine-Learning-Team
kevinbazira created T356985: Maintain models directory structure for model-server make builds to remain consistent with the analytics repo.
Feb 8 2024, 11:33 AM · Machine-Learning-Team
kevinbazira moved T356176: Support building and running of article-descriptions model-server via Makefile from In Progress to Done on the Machine-Learning-Team board.
Feb 8 2024, 11:27 AM · Machine-Learning-Team
kevinbazira closed T356176: Support building and running of article-descriptions model-server via Makefile, a subtask of T352689: Add a script for running the Revert Risk model server locally, as Resolved.
Feb 8 2024, 11:25 AM · Machine-Learning-Team
kevinbazira closed T356176: Support building and running of article-descriptions model-server via Makefile as Resolved.

Support for building the article-descriptions model-server using the Makefile was added and it can be tested using:

# first terminal
$ make article-descriptions
# second terminal
$ curl localhost:8080/v1/models/article-descriptions:predict -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 3}' -H "Content-Type: application/json" --http1.1
$ MODEL_SERVER_PARENT_DIR=article_descriptions make clean
Feb 8 2024, 11:25 AM · Machine-Learning-Team
kevinbazira added a comment to P56336 stack traces for issues experienced when testing make builds for the RR model-servers on linux.

Running the build commands today succeeded for 1/2 as shown below. The previous model download issue is likely to have been caused by a cap from the analytics public repository website as discussed in the meeting.

Feb 8 2024, 6:25 AM · Machine-Learning-Team

Feb 6 2024

kevinbazira committed rMLIS0987acb8f2ab: Makefile: add support for article-descriptions (authored by kevinbazira).
Makefile: add support for article-descriptions
Feb 6 2024, 4:34 PM
kevinbazira created P56336 stack traces for issues experienced when testing make builds for the RR model-servers on linux.
Feb 6 2024, 12:52 PM · Machine-Learning-Team
kevinbazira added a comment to P55971 makefile error.

We did not face the issue of sentencepiece failing to install on linux. This is probably because cmake and pkg-config packages are typically available in most linux systems. The link below shows a macOS-specific solution that resolves this issue:
https://github.com/google/sentencepiece/issues/378#issuecomment-969896519

Feb 6 2024, 8:43 AM

Feb 5 2024

kevinbazira added a comment to P56202 (An Untitled Masterwork).

I have not encountered this on Linux with Python 3.9.2. Are you using Python 3.11? KServe uses ray package <2.5.0 and >=2.4.0. Based on https://pypi.org/project/ray/2.4.0/, this version of the ray package supports Python 3.6 to 3.10.

Feb 5 2024, 5:14 PM

Jan 30 2024

kevinbazira moved T356176: Support building and running of article-descriptions model-server via Makefile from Unsorted to In Progress on the Machine-Learning-Team board.
Jan 30 2024, 1:12 PM · Machine-Learning-Team
kevinbazira triaged T356176: Support building and running of article-descriptions model-server via Makefile as Medium priority.
Jan 30 2024, 1:11 PM · Machine-Learning-Team
kevinbazira created T356176: Support building and running of article-descriptions model-server via Makefile.
Jan 30 2024, 1:09 PM · Machine-Learning-Team

Jan 29 2024

kevinbazira added a comment to T343123: Migrate Machine-generated Article Descriptions from toolforge to liftwing..

@Seddon, in T353127 we were able to make significant improvements in response latency. For example, in T353127#9398823, there was a request that initially had a 14s response time. With subsequent optimization efforts, we managed to reduce this to 4s as seen in T353127#9421055. This reduction was achieved by: exceeding the CloudVPS instance CPU and memory resources; and using CPU core pinning. Both of these methods did not affect the prediction quality.

Jan 29 2024, 8:55 AM · Wikipedia-Android-App-Backlog (Android Release - FY2023-24), Machine-Learning-Team

Jan 25 2024

kevinbazira added a comment to T355742: Assess runtime performance impact of pydantic data models in the RRLA model-server.

I pinged Muniza about the possibility of loosening the knowledge-integrity constraint to allow for pydantic < 2.0.0 and here is her response:

On Slack, Muniza wrote:

... there are breaking code changes between pydantic v1 and v2 so it won't be possible to just loosen constraints. We'd need to downgrade pydantic in KI to v1 which would require some code changes but more importantly pydantic v1 is supposed to be considerably slower than v2 which might impact the latency of the models.

Jan 25 2024, 9:51 AM · Patch-For-Review, Machine-Learning-Team
kevinbazira added a comment to T355742: Assess runtime performance impact of pydantic data models in the RRLA model-server.

Thank you for the suggestion @isarantopoulos, I tried fastapi==0.109.0 and run into the error below. It looks like kserve 0.11.2 doesn't support it.

ERROR: Cannot install -r revert_risk_model/model_server/revertrisk/requirements.txt (line 3) and fastapi==0.109.0 because these package versions have conflicting dependencies.
Jan 25 2024, 7:41 AM · Patch-For-Review, Machine-Learning-Team

Jan 24 2024

kevinbazira claimed T355742: Assess runtime performance impact of pydantic data models in the RRLA model-server.

I have been working on updating knowledge-integrity in the RRLA model-server. Tried running it locally and I am currently getting dependency conflicts between kserve's fastapi pydantic and knowledge-integrity's pydantic as shown below:

ERROR: Cannot install -r revert_risk_model/model_server/revertrisk/requirements.txt (line 1), knowledge-integrity[revertrisk]==0.6.0 and kserve because these package versions have conflicting dependencies.
Jan 24 2024, 4:35 PM · Patch-For-Review, Machine-Learning-Team
kevinbazira created T355742: Assess runtime performance impact of pydantic data models in the RRLA model-server.
Jan 24 2024, 8:07 AM · Patch-For-Review, Machine-Learning-Team

Jan 22 2024

kevinbazira added a comment to T351939: Document load test results.

Until now, this locust prototype has been using the same payload to run a load test on the article-descriptions model-server. Since in wrk we were using multiple payloads by reading an input file, I have updated article_descriptions.py to replicate this functionality using process_payload().

Jan 22 2024, 11:21 AM · Machine-Learning-Team
kevinbazira created P55186 [locust] article descriptions isvc load test script with process_payload().
Jan 22 2024, 11:15 AM · Machine-Learning-Team

Jan 19 2024

kevinbazira added a comment to T351939: Document load test results.

In order to compare historical data from T351939#9469592, I updated article_descriptions.py with lw_stats_analysis() and changed lw_stats_history() to use pandas instead of the csv module. Below is what the comparison report looks like for a given lw_stats_history.csv.

Jan 19 2024, 1:39 PM · Machine-Learning-Team
kevinbazira created P55025 [locust] article descriptions isvc load test script with lw_stats_history() and lw_stats_analysis() using pandas.
Jan 19 2024, 1:14 PM · Machine-Learning-Team
kevinbazira updated the title for P54934 [locust] article descriptions isvc load test script with lw_stats_history() from [locust] article descriptions isvc load test script with lw_history_stats() to [locust] article descriptions isvc load test script with lw_stats_history().
Jan 19 2024, 6:02 AM · Machine-Learning-Team

Jan 18 2024

kevinbazira added a comment to T351939: Document load test results.

locust has a test_stop event listener that can be used to read article_descriptions_stats.csv after it has been generated at the end of a load test run. I have updated the article_descriptions.py file to utilize this event hook to extract the "Aggregated" data from article_descriptions_stats.csv (as shown in T351939#9468383) and save it to lw_stats_history.csv. Here is what the contents of lw_stats_history.csv look like after running 3 load tests:

TimestampRequest CountFailure CountMedian Response TimeAverage Response TimeMin Response TimeMax Response TimeAverage Content SizeRequests/sFailures/s50%66%75%80%90%95%98%99%99.9%99.99%100%
202401181519482035283618.535283709167.00.228887335373979230.037003700370037003700370037003700370037003700
202401181520392035003567.034903644167.00.22916565815223690.036003600360036003600360036003600360036003600
202401181523562034393772.034394105167.00.23078843423267720.041004100410041004100410041004100410041004100

The next step will be working on running a comparative analysis on this data.

Jan 18 2024, 3:45 PM · Machine-Learning-Team