Maniphest T201047

Use Joblib for ORES model serialization
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	awight
	Aug 2 2018, 5:22 PM

Description

We're currently using the vanilla pickle implementation, for no particular reason other than simplicity. There are some good options available, we should check them for disk, memory, or startup time savings.

This patch implements joblib serialization, which has optimized support for large numpy arrays and inline compression: https://github.com/wiki-ai/revscoring/pull/408

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T201047 Use Joblib for ORES model serialization
		Open		None	T173244 [Investigate] Use PMML for prediction model serialization

Event Timeline

awight created this task.Aug 2 2018, 5:22 PM

Restricted Application added subscribers: jeblad, Aklapper. · View Herald TranscriptAug 2 2018, 5:22 PM

Size change with joblib and its default zlib compression:

121M	submodules/articlequality/models
55M	submodules/articlequality/models_new

13M	submodules/draftquality/models
19M	submodules/draftquality/models_new

31M	submodules/drafttopic/models
15M	submodules/drafttopic/models_new

365M	submodules/editquality/models
153M	submodules/editquality/models_new

Runtime profile:

pickle:

 2729 awight    20   0 2533940 1.770g  31676 S   0.3  5.0   0:27.36 python                                                       
 2740 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.00 python                                                       
 2741 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.00 python                                                       
 2745 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.00 python                                                       
 2746 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.00 python                                                       
 2747 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.00 python                                                       
 2748 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.01 python                                                       
 2749 awight    20   0 2533536 1.744g   4600 S   0.0  4.9   0:00.00 python                                                       

pickle loaded via joblib:

 1075 awight    20   0 2542404 1.779g  31900 S   0.3  5.0   0:57.83 python
 1111 awight    20   0 2542148 1.753g   4844 S   0.0  5.0   0:00.01 python
 1112 awight    20   0 2542148 1.753g   4652 S   0.0  5.0   0:00.00 python
 1114 awight    20   0 2542148 1.753g   4652 S   0.0  5.0   0:00.01 python
 1121 awight    20   0 2542148 1.753g   4652 S   0.0  5.0   0:00.00 python
 1122 awight    20   0 2542148 1.753g   4652 S   0.0  5.0   0:00.01 python
 1123 awight    20   0 2542148 1.753g   4652 S   0.0  5.0   0:00.00 python
 1124 awight    20   0 2542148 1.753g   4652 S   0.0  5.0   0:00.01 python

joblib loading joblib-serialized:

 1712 awight    20   0 2550192 1.786g  31588 S   0.3  5.1   0:47.77 python
 1726 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.01 python
 1727 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.00 python
 1728 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.00 python
 1729 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.01 python
 1730 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.00 python
 1731 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.01 python
 1732 awight    20   0 2549936 1.761g   4796 S   0.0  5.0   0:00.01 python

So, we save disk space with compression, but startup takes 2x longer. Another sweetener is that joblib deserialization seems to be backwards-compatible with existing model files, making the migration simple. This isn't much to write home about, although the disk savings are nice.

awight added a subtask: T173244: [Investigate] Use PMML for prediction model serialization.Aug 9 2018, 6:13 PM

My recommendation is that we go ahead with compressed joblib serialization just to get the disk savings and transparent choice of compression algorithm.