Task to track the requirements that need to be addressed before we can have models using word2vec on ORES deployed.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | awight | T187217 [Epic] Support word2vec for production ORES models | |||
Resolved | Sumit | T188445 Implement word2vec featurevector in revscoring | |||
Resolved | awight | T188446 Package word2vec binaries | |||
Resolved | awight | T181678 Plan migration of ORES repos to git-lfs | |||
Resolved | • demon | T181835 Add gitlab to proxies/whitelist for mirroring to phabricator | |||
Resolved | • mmodell | T180627 Support git-lfs in scap | |||
Resolved | awight | T180628 Install git-lfs client (at least on scap targets & masters) | |||
Declined | None | T182085 Connect Phabricator to swift for storage of git-lfs and file uploads. | |||
Resolved | • mmodell | T192042 Create gerrit mirrors for all github-based ORES repos | |||
Resolved | fgiunchedi | T192124 Deploy Scap 3.8.0 to production | |||
Resolved | Halfak | T188447 Update ORES wheels for new revscoring requirements | |||
Resolved | Halfak | T188755 Update ORES requirements to support revscoring 2.2.0 | |||
Resolved | Halfak | T188775 Re-train models with revscoring 2.2.0 |
Event Timeline
Working on the Debian packaging here: https://phabricator.wikimedia.org/source/word2vec/
@Sumit Is the gensim package able to read the gzipped file, or should we decompress during installation?
Here's what the package will look like installed, for now:
dpkg -L word2vec /. /usr /usr/share /usr/share/doc /usr/share/doc/word2vec /usr/share/doc/word2vec/changelog.Debian.gz /usr/share/doc/word2vec/copyright /usr/share/word2vec /usr/share/word2vec/GoogleNews-vectors-negative300.bin.gz
Why do we even need to create a debian package for shipping that single file ? I don't think it's worth the trouble. Can't we just ship it in the ORES repos ? How big is it btw ? https://phabricator.wikimedia.org/source/word2vec/browse/master/GoogleNews-vectors-negative300.bin.gz timeouts and I can't download it.
@awight gensim would work with gzipped file.
@akosiaris the file is of about 1.5GB in size. We decided against storing in ores git repo as it would bloat the repo size and tracking every new version would be a size issue.
Yes, given the size of that file it's a wise decision to not store it directly in the ores git repo. It would have been a disaster to do so. Shipping it as a debian package however is not really better either, for multiple reasons, starting from the fact it would cause puppet issues during installation (slow fetching, network congestions), bloating of the local filesystems as well as the debian repository filesystem, not to mention the effort going into the debianization which is really not worth it for shipping a single file.
How about using git-fat ? IIRC scap does support that already.
Note that I've created T188446: Package word2vec binaries as a more specific sub-task regarding the packaging bits discussed above.