Page MenuHomePhabricator

Use Joblib for ORES model serialization
Closed, DeclinedPublic

Description

We're currently using the vanilla pickle implementation, for no particular reason other than simplicity. There are some good options available, we should check them for disk, memory, or startup time savings.

This patch implements joblib serialization, which has optimized support for large numpy arrays and inline compression: https://github.com/wiki-ai/revscoring/pull/408

Related Objects

Event Timeline

Size change with joblib and its default zlib compression:

121M	submodules/articlequality/models
55M	submodules/articlequality/models_new

13M	submodules/draftquality/models
19M	submodules/draftquality/models_new

31M	submodules/drafttopic/models
15M	submodules/drafttopic/models_new

365M	submodules/editquality/models
153M	submodules/editquality/models_new

Runtime profile:

pickle:

 2729 awight    20   0 2533940 1.770g  31676 S   0.3  5.0   0:27.36 python                                                       
 2740 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.00 python                                                       
 2741 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.00 python                                                       
 2745 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.00 python                                                       
 2746 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.00 python                                                       
 2747 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.00 python                                                       
 2748 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.01 python                                                       
 2749 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.00 python                                                       

pickle loaded via joblib:

 1075 awight    20   0 2542404 1.779g  31900 S   0.3  5.0   0:57.83 python
 1111 awight    20   0 2542148 1.753g   4844 S   0.0  5.0   0:00.01 python
 1112 awight    20   0 2542148 1.753g   4652 S   0.0  5.0   0:00.00 python
 1114 awight    20   0 2542148 1.753g   4652 S   0.0  5.0   0:00.01 python
 1121 awight    20   0 2542148 1.753g   4652 S   0.0  5.0   0:00.00 python
 1122 awight    20   0 2542148 1.753g   4652 S   0.0  5.0   0:00.01 python
 1123 awight    20   0 2542148 1.753g   4652 S   0.0  5.0   0:00.00 python
 1124 awight    20   0 2542148 1.753g   4652 S   0.0  5.0   0:00.01 python

joblib loading joblib-serialized:

 1712 awight    20   0 2550192 1.786g  31588 S   0.3  5.1   0:47.77 python
 1726 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.01 python
 1727 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.00 python
 1728 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.00 python
 1729 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.01 python
 1730 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.00 python
 1731 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.01 python
 1732 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.01 python

So, we save disk space with compression, but startup takes 2x longer. Another sweetener is that joblib deserialization seems to be backwards-compatible with existing model files, making the migration simple. This isn't much to write home about, although the disk savings are nice.

My recommendation is that we go ahead with compressed joblib serialization just to get the disk savings and transparent choice of compression algorithm.

Just found a great resource for further reading, https://www.andrey-melentyev.com/model-interoperability.html

Here's a very similar project, offering models as REST microservices: http://clipper.ai

Halfak renamed this task from Explore alternative model serializations to Use Joblib for ORES model serialization.Apr 9 2019, 9:19 PM