The word2vec data is a 1.5GB file, that we'll need to deploy to all ORES compute nodes. The timeline for us using files of this size is:
- We want the word2vec data ASAP, since it's blocking a production model.
- Might never need to update that file, but we'll probably need it installed for years to come.
- I expect that we'll have our own "embeddings" files of similar size, within a year.
Options for deployment:
- Deploy as a .deb.
- A patch is prepared as https://phabricator.wikimedia.org/source/word2vec/
- @akosiaris has recommended against this in T187217#4005891, because it will slow down or break provisioning machines.
- Deploy in the ORES git repo
- Absolutely not, this would make our repo unusable.
- Deploy as its own git-lfs repo
- Scap might be ready to handle git-lfs
- @awight is not looking forward to being the rocket dog, first adopter for git-lfs in scap. We don't want to block deployment waiting for this kind of insfrastructure work, if it's not mature.