Page MenuHomePhabricator

[Epic] Support word2vec for production ORES models
Closed, ResolvedPublic

Description

Task to track the requirements that need to be addressed before we can have models using word2vec on ORES deployed.

Event Timeline

awight renamed this task from Support for word2vec on ORES deployment to Support word2vec for production ORES models.Feb 26 2018, 3:56 PM
awight updated the task description. (Show Details)
awight updated the task description. (Show Details)

Working on the Debian packaging here: https://phabricator.wikimedia.org/source/word2vec/

@Sumit Is the gensim package able to read the gzipped file, or should we decompress during installation?

Here's what the package will look like installed, for now:

dpkg -L word2vec
/.
/usr
/usr/share
/usr/share/doc
/usr/share/doc/word2vec
/usr/share/doc/word2vec/changelog.Debian.gz
/usr/share/doc/word2vec/copyright
/usr/share/word2vec
/usr/share/word2vec/GoogleNews-vectors-negative300.bin.gz
awight added a subscriber: akosiaris.

@akosiaris Feel like reviewing this debian packaging for sanity?

Why do we even need to create a debian package for shipping that single file ? I don't think it's worth the trouble. Can't we just ship it in the ORES repos ? How big is it btw ? https://phabricator.wikimedia.org/source/word2vec/browse/master/GoogleNews-vectors-negative300.bin.gz timeouts and I can't download it.

Working on the Debian packaging here: https://phabricator.wikimedia.org/source/word2vec/

@Sumit Is the gensim package able to read the gzipped file, or should we decompress during installation?

@awight gensim would work with gzipped file.

Why do we even need to create a debian package for shipping that single file ? I don't think it's worth the trouble. Can't we just ship it in the ORES repos ? How big is it btw ?

@akosiaris the file is of about 1.5GB in size. We decided against storing in ores git repo as it would bloat the repo size and tracking every new version would be a size issue.

Working on the Debian packaging here: https://phabricator.wikimedia.org/source/word2vec/

@Sumit Is the gensim package able to read the gzipped file, or should we decompress during installation?

@awight gensim would work with gzipped file.

Why do we even need to create a debian package for shipping that single file ? I don't think it's worth the trouble. Can't we just ship it in the ORES repos ? How big is it btw ?

@akosiaris the file is of about 1.5GB in size. We decided against storing in ores git repo as it would bloat the repo size and tracking every new version would be a size issue.

Yes, given the size of that file it's a wise decision to not store it directly in the ores git repo. It would have been a disaster to do so. Shipping it as a debian package however is not really better either, for multiple reasons, starting from the fact it would cause puppet issues during installation (slow fetching, network congestions), bloating of the local filesystems as well as the debian repository filesystem, not to mention the effort going into the debianization which is really not worth it for shipping a single file.

How about using git-fat ? IIRC scap does support that already.

Note that I've created T188446: Package word2vec binaries as a more specific sub-task regarding the packaging bits discussed above.

awight renamed this task from Support word2vec for production ORES models to [Epic] Support word2vec for production ORES models.Mar 5 2018, 3:47 PM
awight removed awight as the assignee of this task.
awight added a project: Epic.
awight moved this task from Review to Non-Epic on the Machine-Learning-Team (Active Tasks) board.