Page MenuHomePhabricator

Naming convention for the model storage structure
Closed, ResolvedPublic

Description

We need a naming convention for the directory structure of the model-storage, a formality which can help us stay productive by being consistent. More reasoning here.

In T272874 @kevinbazira and @ACraze highlighted that we need a naming convention for the model docker-images, the recommendation is to go with:

revscoring-<model-type>-<model-name>:$TAG

Other than the docker-images and the model-storage, we will have to also discuss the naming convention for the inference-service-url and how it's connected to ORES API which will have a direct impact on the model and service metrics in grafana and other reports, as well as the logging of the payload. Consistency across these three resource-types will help us avoid maintaining configuration mappings, a nightmare for operations.

In T275709 we have a nice list of 110 ORES models which already have the following categories:

  • wiki
    • enwiki
    • nlwiki
    • wikidata
    • ...
  • model classes:
    • editquality
      • damaging
      • goodfaith
      • reverted
    • articlequality
      • draftquality
      • wp10
      • page_level
      • item_quality
    • topic-routing
      • drafttopic
      • articletopic
  • model type:
    • gradient boosting
    • random forest

Event Timeline

We chose to use S3 object storage over Swift, which comes with:

  • Bucket names can have only lowercase letters, number, dots and hyphens. Typically names are in kebab-case, see AWS S3 bucket naming rules.
  • Object key is the path to a file, a combination of the dir/subdir names and file name. It has more freedom on the choice of special characters (see AWS Creating object key names), although we can follow the bucket naming rules for simplicity.
  • Object key can have tags. This can be useful as meta information, but it can't be used when accessing the objects as the object name has to be unique.

Some random opinionated, personal thoughts on best practices and vision:

  • The metadata (hyperparameters, architecture and model-type) of a model SHOULD NOT be recorded in the object key.
    • Consecutive versions of a model can have different model type. This is becoming more popular with NAS and AutoML. For example the enwiki-editquality-damaging model version 1 might use gradient boosting and version 2 might use random forest.
    • If we have a common name across the stored model and the serving url, we can avoid having to configure the clients/applications when we're updating a model.
    • The recording of such metadata information is the role of MLMD, sacred or mlflow tracking, in order of personal preference. We might introduce Kubeflow pipelines in the future, so thinking about it already makes our decision future proof.
  • The version of the model SHOULD be recorded in the object key.
    • A deployment strategy to have canary rollouts for an A/B testing scenario across different model versions will be handled by the underlying platform capability. In KFServing that is handled by defining different storageUri for each target. A workaround for that would be to have the model version in a subdirectory. The ORES API though has an overlapping role in this deployment strategy. KFServing can also do multi-model serving and used for ensemble.
  • The consumer of the model COULD be indicated in the object key.
    • The model used by a particular wiki might be used by another wiki as well. For example model editquality-damaging can be used by enwiki and wikidata.
    • Although the natural language used for each model COULD be indicated in the object key, as existing ORES models are language specific.
      • Some models might be language agnostic and not apply in this guidance, for example MachineVision models.

With the above thoughts, the proposed naming convention for the object key can be:

<model-class>/<optional-language>/<version>

Assuming the bucket will have a name like s3://ml-models/, a sample structure would look like this:

s3://ml-models
├── article-topic
├── damaging
│   ├── enwiki
│   │   ├── 202103021002
│   │   └── 202104181735
│   └── nlwiki
│       ├── 202011130925
│       └── 202012141830
├── draft-quality
├── draft-topic
├── good-faith
├── item-quality
├── nsfw-classifier
│   ├── 202011130925
│   └── 202012141830
├── page-level
├── reverted
└── wp10

And an example inference-service spec with custom image and hardcoded version would look like that:

apiVersion: "serving.kubeflow.org/v1beta1"
kind: "InferenceService"
metadata:
  name: "edit-quality-damaging-enwiki-202104181735"
spec:
  predictor:
    custom:
      image: wmf-registry/damaging/
      storageUri: "s3://ml-models/damaging/enwiki/202104181735"

Although a Tensorflow CV model (such as T264052) could have this spec given that tensorflow model server detects versions in subdirectories of the storageUri:

apiVersion: "serving.kubeflow.org/v1beta1"
kind: "InferenceService"
metadata:
  name: "nsfw-classifier"
spec:
  predictor:
    tensorflow:
      storageUri: "s3://ml-models/nsfw-classifier/"

@Theofpa thanks a lot for all the inputs, looks great. I have a doubt about the dir/subdir paths though, since IIUC Swift doesn't allow nested bucker/containers but only a flat structure. I see that AWS S3 has the concept of folder, and Openswift seems to offer something similar, but I am wondering pros/cons about the approach of using subdirs in a flat environment (not saying I am against it, just let's discuss the options :)

Thanks @Theofpa , this is really helpful right now as I'm working with a sandbox KF install to test some of our models with.

Object key can have tags. This can be useful as meta information, but it can't be used when accessing the objects as the object name has to be unique.

I noticed you are using a timestamp for the version, is this to ensure a unique key? We currently use semver with ORES but I could see how that could lead to naming collisions.

The metadata (hyperparameters, architecture and model-type) of a model SHOULD NOT be recorded in the object key

agreed, having metadata in the key would lead to ridiculously long names that would be frustrating to deal with, while also not adding much value

I am wondering pros/cons about the approach of using subdirs in a flat environment (not saying I am against it, just let's discuss the options :)

@elukey I think the idea of using subdirs is useful. In the past, it has been helpful to have tiered model classes (editquality/articlequality/etc.) for many reasons, like helping newcomers & non-experts make sense of what is available. Also some of these models will be retrained more than others and will potentially have many more versions available, so I think it makes sense to have them organized a bit. As Theo mentioned, it would be helpful in handling canary rollouts, as we could simply just point to another model within the subdir.

I noticed you are using a timestamp for the version, is this to ensure a unique key? We currently use semver with ORES but I could see how that could lead to naming collisions.

Oh yes, I'd say this is a best practice I recommend. It comes from tensorflow-model-server that reads all subdirectories in the specified model directory and automatically picks up the largest value as the current model. So typically you'd have an increment number as model version, although I find the timestamp more convenient (and if it's in the YYYYMMDDHHMM format it will be incremental). Here's an example:

(base) theofpa@z420:~/models$ tree
.
└── mycoolmodel
    └── 202104232345
        ├── assets
        │   ├── vocab_compute_and_apply_vocabulary_1_vocabulary
        │   └── vocab_compute_and_apply_vocabulary_vocabulary
        ├── saved_model.pb
        └── variables
            ├── variables.data-00000-of-00001
            └── variables.index

4 directories, 5 files
(base) theofpa@z420:~$ docker run -v /home/theofpa/models/mycoolmodel:/model tensorflow/serving --model_base_path=/model --name mycoolmodel
...
2021-04-23 21:48:37.086351: I tensorflow_serving/model_servers/server_core.cc:587]  (Re-)adding model: model
2021-04-23 21:48:37.187269: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: model version: 202104232345}
2021-04-23 21:48:37.187344: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: model version: 202104232345}
2021-04-23 21:48:37.187372: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: model version: 202104232345}
2021-04-23 21:48:37.187470: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:32] Reading SavedModel from: /model/202104232345
2021-04-23 21:48:37.202740: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:55] Reading meta graph with tags { serve }
2021-04-23 21:48:37.202773: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:93] Reading SavedModel debug info (if present) from: /model/202104232345
2021-04-23 21:48:37.288550: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle.
2021-04-23 21:48:37.300976: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3192320000 Hz
2021-04-23 21:48:37.330048: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /model/202104232345
2021-04-23 21:48:37.350279: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 162810 microseconds.
2021-04-23 21:48:37.353126: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /model/202104232345/assets.extra/tf_serving_warmup_requests
2021-04-23 21:48:37.354463: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: model version: 202104232345}
2021-04-23 21:48:37.357867: I tensorflow_serving/model_servers/server.cc:371] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2021-04-23 21:48:37.360079: I tensorflow_serving/model_servers/server.cc:391] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...

create a "new" model version:

(base) theofpa@z420:~/models$ cp -fr mycoolmodel/202104232345 mycoolmodel/202104232348
(base) theofpa@z420:~/models$ tree
.
└── mycoolmodel
    ├── 202104232345
    │   ├── assets
    │   │   ├── vocab_compute_and_apply_vocabulary_1_vocabulary
    │   │   └── vocab_compute_and_apply_vocabulary_vocabulary
    │   ├── saved_model.pb
    │   └── variables
    │       ├── variables.data-00000-of-00001
    │       └── variables.index
    └── 202104232348
        ├── assets
        │   ├── vocab_compute_and_apply_vocabulary_1_vocabulary
        │   └── vocab_compute_and_apply_vocabulary_vocabulary
        ├── saved_model.pb
        └── variables
            ├── variables.data-00000-of-00001
            └── variables.index

7 directories, 10 files

tensorflow/serving detects the new model version and promotes it after validation:

2021-04-23 21:49:00.112316: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: model version: 202104232348}
2021-04-23 21:49:00.112396: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: model version: 202104232348}
2021-04-23 21:49:00.112421: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: model version: 202104232348}
2021-04-23 21:49:00.112508: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:32] Reading SavedModel from: /model/202104232348
2021-04-23 21:49:00.127364: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:55] Reading meta graph with tags { serve }
2021-04-23 21:49:00.127403: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:93] Reading SavedModel debug info (if present) from: /model/202104232348
2021-04-23 21:49:00.188418: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle.
2021-04-23 21:49:00.222246: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /model/202104232348
2021-04-23 21:49:00.241849: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 129344 microseconds.
2021-04-23 21:49:00.246679: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /model/202104232348/assets.extra/tf_serving_warmup_requests
2021-04-23 21:49:00.248762: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: model version: 202104232348}
2021-04-23 21:49:00.248855: I tensorflow_serving/core/loader_harness.cc:138] Quiescing servable version {name: model version: 202104232345}
2021-04-23 21:49:00.248901: I tensorflow_serving/core/loader_harness.cc:145] Done quiescing servable version {name: model version: 202104232345}
2021-04-23 21:49:00.248919: I tensorflow_serving/core/loader_harness.cc:120] Unloading servable version {name: model version: 202104232345}
2021-04-23 21:49:00.258960: I ./tensorflow_serving/core/simple_loader.h:364] Calling MallocExtension_ReleaseToSystem() after servable unload with 693061
2021-04-23 21:49:00.259008: I tensorflow_serving/core/loader_harness.cc:128] Done unloading servable version {name: model version: 202104232345}

In our case we can't make use of this deployment functionality as the storage-initializer does not monitor the s3 bucket to sync such changes, but the best practice of using timestamp for the directory of the model version is useful.

And that shouldn't overlap with the semver policy if you want to respect it, it's just the storage location.

On the other hand (opinionated answer follows), the model artifact is not a software artifact to use semver. In a future where we might have a continuous training pipeline which automatically deploys the new versions of a model many times a day, having the timestamp format for the model artifact version would make things easier.

Oh yes, I'd say this is a best practice I recommend. It comes from tensorflow-model-server that reads all subdirectories in the specified model directory and automatically picks up the largest value as the current model. So typically you'd have an increment number as model version, although I find the timestamp more convenient (and if it's in the YYYYMMDDHHMM format it will be incremental). Here's an example:

One question about timestamps-as-versions is what granularity is useful. In my previous experiences (e.g. using the date, but not time of day for DNS zone serials), making the wrong decisions there early on can lead to a lot of headache later.

tl;dr: do you think minute-granularity is enough, or should we also include seconds? I will admit that I have little experience with the typical "deployment pace" of ML models.

@klausman I think most of our models should be ok with minute-granularity. The only case I can think of where it could be problematic is if we train two slightly different versions of the same model in parallel (and they finish within ~60 secs of each other), but I don't see that happening anytime soon. We could also just limit the pipelines to only running one at a time to make sure we avoid it.

Echoing @klausman and @calbon from our ML Team Technical Meeting today:

Let's go with seconds-granularity for now when creating a timestamp version. There could be a case where we are pushing multiple models in bulk and it would be easier to go with this resolution now rather than attempting to do it later down the road.

I'm currently refactoring my model_upload script to reflect this and will create a patch for the inference-services repo this week.

Change 719668 had a related patch set uploaded (by Accraze; author: Accraze):

[machinelearning/liftwing/inference-services@main] utils: added model upload script

https://gerrit.wikimedia.org/r/719668

Change 736848 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::statistics::explorer: add ml profile

https://gerrit.wikimedia.org/r/736848

Change 736848 merged by Elukey:

[operations/puppet@production] role::statistics::explorer: add ml profile

https://gerrit.wikimedia.org/r/736848

Change 719668 abandoned by Accraze:

[machinelearning/liftwing/inference-services@main] utils: added model upload script

Reason:

Moved model upload script to operations/puppet/+/736848

https://gerrit.wikimedia.org/r/719668

ACraze claimed this task.

@elukey - I think we are all good on this task. Major thanks to @Theofpa for helping us out on this and suggesting a sane naming structure. Going to mark as RESOLVED, feel free to re-open if anyone feels we need to discuss further.