User Details
- User Since
- Jun 17 2019, 4:51 PM (196 w, 4 d)
- Roles
- Disabled
- IRC Nick
- accraze
- LDAP User
- Unknown
- MediaWiki User
- ACraze (WMF) [ Global Accounts ]
May 2 2022
Mar 14 2022
Mar 9 2022
Mar 8 2022
Feb 28 2022
To keep archives happy: We had discussed potentially creating a base 'revscoring' isvc class that would include error handling. After looking into the DRY approach, it seems we may want to keep each class separate as there are just enough differences to where DRY would not be all that helpful in the long run. (i.e. testing for specific revscoring errors, fetching text vs feature values, model.bin vs. model.bz2, etc..)
Feb 24 2022
To keep archives happy:
Feb 22 2022
After talking with @elukey and @kevinbazira on IRC, it seems that we may need to split up the revscoring-editquality namespace as some nodes are filling up with pods.
Feb 17 2022
Feb 16 2022
Feb 15 2022
The feature names are stored in the model instance and then we get the feature values back from the extractor, so we will need to combine them together before including them in the response, something like:
Feb 14 2022
I ran this inside the models/ dir in the editquality repo, just make sure to run git lfs pull first
I just uploaded all of the editquality model files to storage on stat1008 (using this script: P20723).
Feb 9 2022
To recap from discussion at the team technical meeting today:
- revscoring transformers for editquality/draftquality/topics are very heavy, especially since they need to mount the model binary in the pod to extract features.
- articlequality does not need to load the model first to extract features, as it accepts raw article text.
- for editquality, using transformers means 30+ additional pods
- there is no clear benefit of managing this overhead complexity due to the way revscoring works
I managed to get an explainer attached to the enwiki-goodfaith Inference Service on ml-sandbox and was able to retrieve an explanation from the :explain endpoint.
Feb 8 2022
Feb 7 2022
The deploy CR has been merged and we are scheduling a deployment on Wednesday morning, will update here once it is complete.
I have cleared out & deleted the old s3 buckets. I have also added documentation for our dev model storage: https://wikitech.wikimedia.org/wiki/User:Accraze/MachineLearning/ML-Sandbox#Model_Storage
Confirming I was able to run the editquality transformer image on ml-sandbox last week:
Feb 4 2022
@kevinbazira - I took a look at your isvc spec, tried to deploy it and noticed that the Knative Revisions were failing.
Feb 3 2022
The articlequality PR has been merged and the repo has been mirrored to Gerrit again.
@kevinbazira - I believe model storage is now ready on ml-sandbox. Can you try these steps to see if you can upload a model binary to our minio object store?
I have installed a minio test instance on ml-sandbox and am able to use it for model storage. I have also configured s3cmd to use minio and can use our model_upload script.
Excellent, networking issues have been resolved and we can now run transformers on ml-sandbox.
Marking this as RESOLVED.
Feb 2 2022
@kevinbazira - can you try hitting enwiki-articlequality on the ml-sandbox to confirm the transformer routing works for you too? I have a test script in my home_dir if you want to use:
I think I've got the networking issue solved. The top-level isvc was unable to route to the transformer, because my cluster-local-gateway did not have the ports configured correctly in the Istio Operator.
Feb 1 2022
Jan 31 2022
Repos have been mirrored and should be in sync again.
I've been reading the KServe docs and found an example of using minio for storage in a local cluster:
https://github.com/kserve/website/blob/main/docs/modelserving/kafka/kafka.md
Jan 28 2022
I've been rebuilding the sandbox cluster using the install script with the updated charts for knative and kserve. The KServe stack is able to load with all containers running fine, however, now when I deploy a new isvc (i.e. enwiki-articlequality) in a custom namespace like kserve-test, it seems that the images are unable to be pulled from WMF docker registry:
@kevinbazira - I noticed an issue with the new transformer. It seems since we need to pass self.model.features to the extractor, we will also need to load the model inside of the transformer as well. The model binary file should already be mounted at /mnt/models/model.bin, so you can load it similar to the predictor.
Jan 27 2022
I think there is still some more work to be done for observability, so maybe leave it for the future.
@elukey -- any objections to this? ^^
Confirming the model repos have all been manually mirrored.
Starting the initial work to deploy the article quality model for Dutch Wikipedia.
It looks like we can do this using tornado.web.HTTPError and add the status code and message, similar to how it's done here:
https://github.com/kserve/kserve/blob/master/python/kserve/kserve/model.py#L112
Jan 26 2022
I can hit a transformer endpoint directly, but I get a 503 error. When I inspect the transformer logs, I see the following
The cert-manager blocker is gone (see: T298976). I was able to deploy the new transformer successfully to both eqiad and codfw
It seems the editquality-transformer image has not been published yet. I think this is due to the integration/config patchset being merged after the the pipeline patchset got merged in the inference services repo. The 'publish' pipeline only gets triggered during the postmerge on jenkins now, so we will need to trigger the pipeline in order to get an image into the WMF docker registry.
Deployment for the new transformer is currently blocked on T298976
Jan 24 2022
Jan 20 2022
Jan 19 2022
@elukey i've been reading the kserve docs (https://kserve.github.io/website/master/modelserving/v1beta1/custom/custom_model/#parallel-inference) and I think we should tune the 'frontend-endpoint' and try the tornado workers first. Definitely interested in the ray workers too though, but I think that might be a bit more complex..?
Ok I think all images that need to be updated have been updated. Going to mark this as RESOLVED.