Page MenuHomePhabricator

Prepare 4 ORES English models for Lift Wing
Closed, ResolvedPublic

Description

Once Lift Wing is up, I'd like to have four models ready to go. Probably the model's with easiest migration are some set of these:

en.reverted
en.damaging
en.goodfaith
en.articlequality
en.draftquality

The unanswered question I have is whether it is possible to migrate these models out of ORES without retraining them from scratch. I hope so, but I need to dig into how KFServing standalone works before I could say for certain.

Related Objects

Event Timeline

I experimented a bit with packaging a revscoring model as a custom KFServing model a few months back.
Here's the WIP code: https://gist.github.com/accraze/72e41bc2d2e11f3b03c8f795bf3f657e

We still need to figure out how to package dependencies (like enchant dictionary, dictionary assest like myspell-en-us and nltk.stopwords). My guess is that we'll wind up creating a custom Docker image that includes all of this stuff, although KFServing might have a way to handle this already.

oh right, I remember really struggling with enchant. Yeah, I might work on a docker image to do this because it was really brutal last time to config my environment

calbon renamed this task from TKTK migrate 4 ORES english models to Lift wing to Prepare 4 ORES English models for Lift Wing.Feb 1 2021, 10:37 PM

You might want to use the ORES dockerfile as a starting point: https://github.com/wikimedia/ores/blob/master/Dockerfile

IIRC that should cover most of the dependencies to run a model, although you will still need to include the specific language assets / dictionary files, all of which are listed here:
https://github.com/wikimedia/revscoring/blob/master/.travis.yml#L21

The unanswered question I have is whether it is possible to migrate these models out of ORES without retraining them from scratch. I hope so, but I need to dig into how KFServing standalone works before I could say for certain.

@calbon turns out it is possible to migrate the models without retraining them.

Based on the KFServing documentation, in order to deploy our custom revscoring models, we'd have to wrap each model into an inference server that would be able to take an input (e.g Wikipedia article revision id) and return an output (e.g prediction)

The inference server would have the following items;

  1. a docker image that has revscoring installed,
  2. the model we are to use and
  3. python ML code that inherits both KFServing and revscoring to take the input sent and return an output.

Thanks to @ACraze's guidance, I have created the docker image attached below. It runs revscoring with sample ML code that returns a prediction. It doesn't inherit from KFServing yet because there was no way to test this without KFServing.

These are the steps I was taking to run the container;

  1. Build image with friendly name:version
$ docker build -t revscoring_image:v4 .
  1. View list of all images to confirm my image was built successfully
$ docker images
  1. Run docker container bash
$ docker run -it --entrypoint=/bin/bash revscoring_image:v4 
  1. Run revscoring within the container to confirm it works
# python3 run_model.py

If we are going to build docker images for every model, then we might need to sit as a team and agree on;

  1. model docker image naming convention
  2. making these images optimal and remove things we might not need (possibly have one optimized base image)
  3. public or private repo for these images

Great work on this so far @kevinbazira

If we are going to build docker images for every model, then we might need to sit as a team and agree on;

  • making these images optimal and remove things we might not need (possibly have one optimized base image)

I think creating an optimized base image for revscoring would be valuable. There are a few things we can trim from the current image, in general we should try to keep it < 1 GB. Also things like layer caching our dependencies can really help speed up the build time.

  • model docker image naming convention

For naming, we can probably just use something like revscoring-<model-type>-<model-name>:$TAG, although I'm open to any other ideas.

  • public or private repo for these images

re: repos - not sure how we want to handle this going forward...maybe a monorepo for all revscoring models? It might be overkill to have a separate repo for each. I think for now we will do this on gerrit and then move to gitlab when our instance is ready.

For most of the models (editquality/articlequality/etc.) we might only need to inject the model binary file and include the correct language assets to a container built off the base image, so this could be handled easily in our CI/CD pipeline (jenkins for now and then move to gitlab).

Some thoughts on building a custom image per model.

At KFServing, although we have the capability of serving custom images per model, it's not best practice to use it. We have 6-7 generic images for serving models of different frameworks (tensorflow-serving, torchserve, sklearn, xgboost, etc), and each infererenceservice is spawning that common image. Then a sidecar container called storage-initialiser, downloads the model file (the serialised weights&biases) in the format of the particular framework (savedmodel, mar, pickle/joblib, etc) from the remote storage and serves it using the common image.

The method above give us less network traffic, less docker image storage on the kubelet, etc, as the common image is cached. Especially on high workloads where you have scaling involved, it would be prefered to offload your kubelets from constantly downloading all the images of your different models. In the case of sklearn for example, the sklearnserver is ~180MB and the pickle file can be a few MBs. This is a fundamental component of the inferenceservice that we call predictor, explained nicely in @ACraze's presentation last week.

We also have the concept of the transformer, a set of pre/postprocessor tasks which run in a separate container to do things which are not related to the inference, such as image resizing for cv models or tokenization for nlp models. The fact that the transformer is separated from the predictor allows us to manage the scaling properly, and have separation of duties on components. The communication across these two services/pods/containers is handled automatically by istio/knative and it's the main role of the kfserving-controller to manage.

I've built the revscoring_image4 of @kevinbazira and explored the revscoring library. The extractor is fetching the revision and produces the features for scoring with the gradient boosting model. These look like two separate tasks, a transformer which fetches and produces the features, and a predictor which receives the features from the transformer and scores them. So my advice is to use two images, a transformer for the extractor and a predictor for the scoring.

The tranformer can perform the heavyduty task of fetching the article and generating the features. It's also a fat image, as it contains the nltk data required for that task and a lot of libraries that you're trying to optimise/clean up. This task in the future can be engineered further, for example maybe you'd like to have these features stored and offered by feast serving so you can reuse them in the scoring of the other models. This way you'll improve the overall latency as you'll essentially cache the feature engineering task instead of recalculating it for each model.

The predictor can focus on getting the features values it got from the tranformer passed through the trees of the model file, and that doesn't require all these libraries that the transformer has. It's going to be a lightweight container that does this one task of the scoring, using sklearn's gradient boosting classifier.

Thank you @ACraze and @Theofpa for your suggestions.

@Theofpa I will definitely try to use two images with one as a transformer and another as a predictor soon as our KFServing environment is up and running. Will keep you posted on this.

@ACraze as I was looking for things to trim from the current image, I checked for the heaviest layers using the steps below;

  1. Get image id.
$ docker images -a
  1. List image layers with their sizes.
$ docker history --no-trunc <image id>

My results were:

IMAGE                                                                     CREATED       CREATED BY                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 SIZE      COMMENT
sha256:760d1f3b548ca70d6d951a388ddc01d99eb0437eab38d3401ed0c1e353f527b8   6 days ago    /bin/sh -c #(nop)  CMD ["python3" "run_model.py"]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          0B        
sha256:6d6b5eda228fc1f0d23342a0e0b00adfb2779d03b199023fbfa8fe53b1b246d9   6 days ago    |1 DEBIAN_FRONTEND=noninteractive /bin/sh -c pip3 install --no-cache-dir -r ./requirements.txt    && python3 -m nltk.downloader omw sentiwordnet stopwords wordnet                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         865MB     
sha256:008739702904e2e9040ddc4ca96a0496bdffaf33ce71ee6d0a11b0c47d30caa3   6 days ago    /bin/sh -c #(nop) COPY multi:f1de739b1ccf825435bc42eacc53a10e395853c6dc981ef0f5e821a880f1a8b0 in ./                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        10.4MB    
sha256:68c887a99638c9710acf623a92aaf80908d67e841bf6fa487373046380a07e80   6 days ago    |1 DEBIAN_FRONTEND=noninteractive /bin/sh -c apt-get update     && apt-get upgrade -y     && apt-get install -y         python3-pip         python3-dev         g++         gfortran         liblapack-dev         libopenblas-dev         libenchant1c2a         aspell-ar         aspell-bn         aspell-el         aspell-id         aspell-is         aspell-pl         aspell-ro         aspell-sv         aspell-ta         aspell-uk         myspell-cs         myspell-de-at         myspell-de-ch         myspell-de-de         myspell-es         myspell-et         myspell-fa         myspell-fr         myspell-he         myspell-hr         myspell-hu         myspell-lv         myspell-nb         myspell-nl         myspell-pt-pt         myspell-pt-br         myspell-ru         myspell-hr         hunspell-bs         hunspell-ca         hunspell-en-au         hunspell-en-us         hunspell-en-gb         hunspell-eu         hunspell-gl         hunspell-it         hunspell-hi         hunspell-sr         hunspell-vi         voikko-fi                                                                                                  795MB     
sha256:8cd0f792fe0d1d4a7c7e485991582448f7c6e657525f5b781e724b6acfbcb4a8   6 days ago    /bin/sh -c #(nop)  ARG DEBIAN_FRONTEND=noninteractive                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0B        
sha256:2ace0ceab9fe3709ea16d85c236aa75b27fa03cff04b4b87b501eb00e666d06b   13 days ago   /bin/sh -c #(nop) WORKDIR /app                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             0B        
sha256:2fb943f0ebb82184f66fa63be4c6f747a2ed34bd0ae7a73761efdb7004ac462d   13 days ago   /bin/sh -c #(nop)  ENV APP_HOME=/app                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0B        
sha256:f63181f19b2fe819156dcb068b3b5bc036820bec7014c5f77277cfa341d4cb5e   5 weeks ago   /bin/sh -c #(nop)  CMD ["/bin/bash"]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       0B        
<missing>                                                                 5 weeks ago   /bin/sh -c mkdir -p /run/systemd && echo 'docker' > /run/systemd/container                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 7B        
<missing>                                                                 5 weeks ago   /bin/sh -c [ -z "$(apt-get indextargets)" ]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                0B        
<missing>                                                                 5 weeks ago   /bin/sh -c set -xe   && echo '#!/bin/sh' > /usr/sbin/policy-rc.d  && echo 'exit 101' >> /usr/sbin/policy-rc.d  && chmod +x /usr/sbin/policy-rc.d   && dpkg-divert --local --rename --add /sbin/initctl  && cp -a /usr/sbin/policy-rc.d /sbin/initctl  && sed -i 's/^exit.*/exit 0/' /sbin/initctl   && echo 'force-unsafe-io' > /etc/dpkg/dpkg.cfg.d/docker-apt-speedup   && echo 'DPkg::Post-Invoke { "rm -f /var/cache/apt/archives/*.deb /var/cache/apt/archives/partial/*.deb /var/cache/apt/*.bin || true"; };' > /etc/apt/apt.conf.d/docker-clean  && echo 'APT::Update::Post-Invoke { "rm -f /var/cache/apt/archives/*.deb /var/cache/apt/archives/partial/*.deb /var/cache/apt/*.bin || true"; };' >> /etc/apt/apt.conf.d/docker-clean  && echo 'Dir::Cache::pkgcache ""; Dir::Cache::srcpkgcache "";' >> /etc/apt/apt.conf.d/docker-clean   && echo 'Acquire::Languages "none";' > /etc/apt/apt.conf.d/docker-no-languages   && echo 'Acquire::GzipIndexes "true"; Acquire::CompressionTypes::Order:: "gz";' > /etc/apt/apt.conf.d/docker-gzip-indexes   && echo 'Apt::AutoRemove::SuggestsImportant "false";' > /etc/apt/apt.conf.d/docker-autoremove-suggests   811B      
<missing>                                                                 5 weeks ago   /bin/sh -c #(nop) ADD file:2a90223d9f00d31e31eff6b207c57af4b7d27276195b94bec991457a6998180c in /                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           72.9MB

And it looks like the 2 layers that install the NLTK data and the *spell language dictionaries are the heaviest.

... pip3 install --no-cache-dir -r ./requirements.txt && python3 -m nltk.downloader ... 865MB   
... apt-get update   && apt-get upgrade -y   && apt-get install -y   python3-pip   ...  795MB

These seem like dependencies we can't get rid of. Is there a way you think we can cut off the weight of these 2 layers? Maybe cache them somewhere?

There is also a possibility that Kubeflow has a way to handle this issue that I am not yet aware of. We will be able to establish this when our environment is up.

@kevinbazira we discussed this a bit at today's team meeting; I wonder if it is possible to load the model file in to sklearn and get similar performance?

We could use the sklearn-server that comes with KFServing, which is much smaller. We may still need to use revscoring.api.Extractor to get the feature values using a transformer, however the model itself could then remain small. I'm not sure how much this will help, but it still might be worth looking into.

There are ways to reduce its size:

  • Add && rm -rf /var/lib/apt/lists/* after installing python3-pip, this is a common pattern to reduce the size of docker image.
  • Or better, use multi-staged docker images
  • You don't need to install everything if the model doesn't need it. like aspell or hunspell libraries for languages other than English.

Also these large files are not used:

/usr/local/lib/python3.8/dist-packages/gensim \
/usr/local/lib/python3.8/dist-packages/lxml \
/usr/lib/gcc/x86_64-linux-gnu/9/cc1plus \
/usr/lib/gcc/x86_64-linux-gnu/9/f951 \
/usr/lib/gcc/x86_64-linux-gnu/9/lto1 \
/usr/lib/gcc/x86_64-linux-gnu/9/cc1 \
/usr/lib/x86_64-linux-gnu/libicudata.so.66.1 \
/usr/lib/x86_64-linux-gnu/liblapack_pic.a \
/usr/lib/x86_64-linux-gnu/lapack/liblapack.a