User Details
- User Since
- Feb 15 2022, 2:51 PM (50 w, 3 d)
- Availability
- Available
- IRC Nick
- aiko
- LDAP User
- Unknown
- MediaWiki User
- AChou-WMF [ Global Accounts ]
Yesterday
For the first and second questions, Isaac has answered. For the question about the drafttopic model, considering the outlinks model performs worse on low-link/new articles, I think we can keep the drafttopic model for now (although it only supports enwiki).
Wed, Feb 1
A new model that works with transformers 4.25.1 and torch 1.13.1 is uploaded:
(It is mainly because joblib serialisation specifics. It is needed to reload the model with a new transformers version and reserialize the model dump)
The task is done.
The problem is resolved.
Mon, Jan 30
Thu, Jan 12
Wed, Jan 11
In Knowledge Integrity's latest MR, we declared torch as a direct dependency for the multilingual model and added pytorch cpu index as a secondary source for poetry. Therefore, we no longer need to install torch individually in the requirements.txt.
In Knowledge Integrity's latest MR, we declared dependencies for each model as an extra group. In this way, each model's dependencies are no longer associated with a KI version tag like 0.1.0 for language-agnostic model; 0.2.0 for multilingual model. Users can just install dependencies for the model they want to use by including its name in the installation command.
Fri, Jan 6
@isarantopoulos I saw you upgraded some dependencies for revscoring. One question is whether they are compatible with the current models, because the models were trained with the old environment and dependencies.
Dec 23 2022
Current status:
Dec 22 2022
Dec 21 2022
Dec 20 2022
We rebuilt our docker image with Luca's knowledge integrity fork repository which removed torch dependency. When we tested the image in ml-sandbox, we got
Message: None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Therefore, we confirmed that transformer needs torch dependency as they assume you already have a deep learning library installed.
Dec 19 2022
Update:
- we fixed the issue regarding "missing" responses from MW API and deployed a new image (2022-12-14-170742-publish, corresponding to knowledge integrity v0.1) to codfw/eqiad.
@elukey Changing model path is good, thanks for your help! :)
Dec 16 2022
Dec 15 2022
Dec 14 2022
The model has been uploaded to Thanos Swift:
aikochou@stat1004:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/experimental/revertrisk/20221214175551/ 2022-12-14 18:00 2647804395 s3://wmf-ml-models/experimental/revertrisk/20221214175551/model.pkl
The size is around 2.5G.
Dec 13 2022
The problem is their hosts are not set correctly.
Dec 9 2022
The issue with missing responses in KServe was because we shared a client session between requests, the host header couldn't get updated correctly.
Dec 5 2022
@MunizaA your theory is right! The host header is not getting updated. It's stuck to the first host it uses. This time I got all the responses from ru wikipedia. I checked the successful request responses of 67538140 (pl) and 36418681 (uk), and they are clearly from ruwiki. Also, the prediction results are totally different from the previous test.
I wrote a lua script https://phabricator.wikimedia.org/P42235 that can read a file with multiple inputs and generate different requests when wrk executes. If specifing n threads, n log files will be created to log responses per thread.
aikochou@ml-sandbox:~/isvcs/revertrisk$ cat input_10.tsv ru 123855516 de 224199451 ru 123744978 ru 123855440 ru 123796333 en 1096720424 ru 123727072 en 1096855066 pl 67538140 uk 36418681
The poor performance I reported in my last comment was actually due to Macbook with the M1 (Max) processor. I tested the model on ML-Sandbox (our dev cluster, 8 CPUs) and the inference time is much better than it was on Mac.
Nov 28 2022
Nov 24 2022
The way we test batch prediction is to have a spark UDF like:
@udf def getPrediction(body): inference_url = 'https://inference-staging.svc.codfw.wmnet:30443/v1/models/revert-risk-model:predict' headers = { 'Host': 'revert-risk-model.experimental.wikimedia.org', 'Content-Type': 'application/x-www-form-urlencoded', } resp = requests.post(inference_url, headers=headers, data=body) if resp.status_code == 200: return resp.text else: return resp
and then use it on a dataframe:
df_100 = df_100.select("*", getPrediction("data").alias("result")) df_100 = df_100.withColumn("prediction", json_tuple("result", "score")) pandas_df = df_100.toPandas()
I was able to run the model server locally with docker yesterday, but there are two issues worth noting.
Nov 22 2022
Nov 17 2022
Hi @Isaac the issue has been solved, but the task hasn't been updated. I'm doing it now. Thanks for the heads up.
The revertrisk model has been deployed to production. I'm going to mark this as RESOLVED. We'll open other tasks for new model deployment when needed.
@MunizaA The reason of the socket connection error seems to be the limited file opening on Linux according to the article. It is not a problem from the model server, so I think we don't need to worry too much. Also note that deploy1002 is not only the deployment server for ML services, but also MediaWiki and all Wikimedia kubernetes services, so wrk may be easier to reach the limit.
This task has been done. I'm going to mark this as RESOLVED. We'll follow up DE team's new page state stream, and open other tasks when needed.
The new outlinks topic model has been deployed to production. I'm going to mark this as RESOLVED.
Nov 14 2022
Nov 11 2022
Thanks for updating the model card! I've tested a model server with the new model locally and the model seems to be working great :)
Nov 8 2022
The new model size looks very suspicious, maybe you could check if you have uploaded the model properly.
aikochou@stat1004:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/articlequality/fawiki/20221107044250/ 2022-11-07 04:42 132 s3://wmf-ml-models/articlequality/fawiki/20221107044250/model.bin aikochou@stat1004:~$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls s3://wmf-ml-models/articlequality/fawiki/20220509071250/ 2022-05-09 07:12 3632132 s3://wmf-ml-models/articlequality/fawiki/20220509071250/model.bin
I tested outlink with benthos for around 9 hours the other day (here is the grafana metrics), I observed it returned ~1800 Bad Requests error with "No matching article or the page has no outlinks".
Nov 4 2022
Nov 3 2022
@kevinbazira thanks for working on this. I can see that the quality predictions of the new model remain at the C level for both revisions, indicating that the model takes into account both ref tags and sfn templates. That's great! :) Although I'm a bit surprised that the predicted quality is not GA, I think it might be because the new sfn features also affects other related features like the proportion of references in the article, etc, so nevermind.
Nov 2 2022
Some load test results:
If there is an upstream issue, we're not getting the right content type for JSON data, so it would raise a ValueError of Could not decode as JSON. (see line 88 and line 101-107 in mwapi/async_session.py)
Nov 1 2022
Oct 31 2022
@diego I added a section https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Usage in our documentation about how to access inference services internally.
Something interesting I found when I ran benthos for outlink isvc.
Oct 28 2022
The revert-risk model has been deployed to production today. :)
Oct 26 2022
The model has been uploaded to Thanos Swift:
Oct 25 2022
Current status:
The code change for this task are complete. Going to mark this as RESOLVED. I'll open another task when we're going to test large requests/bulk inference from Hadoop.
Oct 20 2022
The output change has been applied to all other revscoring model servers.
@Isaac Got it! For the moment I only aligned the event output with the existing ORES model. I think it's not super urgent to change the output for on-demand requests now and it doesn't seem very relevant to this task. If in the future we deem it necessary, I'll open a task for that. :)