[Epic] Support word2vec for production ORES models
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Sumit
	Feb 13 2018, 5:09 PM

Description

Task to track the requirements that need to be addressed before we can have models using word2vec on ORES deployed.

Related Objects
Search...

Status	Assigned	Task
Resolved	awight	T187217 [Epic] Support word2vec for production ORES models
Resolved	Sumit	T188445 Implement word2vec featurevector in revscoring
Resolved	awight	T188446 Package word2vec binaries
Resolved	awight	T181678 Plan migration of ORES repos to git-lfs
Resolved	• demon	T181835 Add gitlab to proxies/whitelist for mirroring to phabricator
Resolved	• mmodell	T180627 Support git-lfs in scap
Resolved	awight	T180628 Install git-lfs client (at least on scap targets & masters)
Declined	None	T182085 Connect Phabricator to swift for storage of git-lfs and file uploads.
Resolved	• mmodell	T192042 Create gerrit mirrors for all github-based ORES repos
Resolved	fgiunchedi	T192124 Deploy Scap 3.8.0 to production
Resolved	Halfak	T188447 Update ORES wheels for new revscoring requirements
Resolved	Halfak	T188755 Update ORES requirements to support revscoring 2.2.0
Resolved	Halfak	T188775 Re-train models with revscoring 2.2.0

Event Timeline

Sumit created this task.Feb 13 2018, 5:09 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2018, 5:09 PM

awight renamed this task from Support for word2vec on ORES deployment to Support word2vec for production ORES models.Feb 26 2018, 3:56 PM

awight updated the task description. (Show Details)

Working on the Debian packaging here: https://phabricator.wikimedia.org/source/word2vec/

@Sumit Is the gensim package able to read the gzipped file, or should we decompress during installation?

Here's what the package will look like installed, for now:

dpkg -L word2vec
/.
/usr
/usr/share
/usr/share/doc
/usr/share/doc/word2vec
/usr/share/doc/word2vec/changelog.Debian.gz
/usr/share/doc/word2vec/copyright
/usr/share/word2vec
/usr/share/word2vec/GoogleNews-vectors-negative300.bin.gz

awight claimed this task.Feb 26 2018, 5:07 PM

awight moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.

@akosiaris Feel like reviewing this debian packaging for sanity?

Why do we even need to create a debian package for shipping that single file ? I don't think it's worth the trouble. Can't we just ship it in the ORES repos ? How big is it btw ? https://phabricator.wikimedia.org/source/word2vec/browse/master/GoogleNews-vectors-negative300.bin.gz timeouts and I can't download it.

In T187217#4001947, @awight wrote:

Working on the Debian packaging here: https://phabricator.wikimedia.org/source/word2vec/

@Sumit Is the gensim package able to read the gzipped file, or should we decompress during installation?

@awight gensim would work with gzipped file.

In T187217#4005384, @akosiaris wrote:

Why do we even need to create a debian package for shipping that single file ? I don't think it's worth the trouble. Can't we just ship it in the ORES repos ? How big is it btw ?

@akosiaris the file is of about 1.5GB in size. We decided against storing in ores git repo as it would bloat the repo size and tracking every new version would be a size issue.

In T187217#4005825, @Sumit wrote:

In T187217#4001947, @awight wrote:

Working on the Debian packaging here: https://phabricator.wikimedia.org/source/word2vec/

@Sumit Is the gensim package able to read the gzipped file, or should we decompress during installation?

@awight gensim would work with gzipped file.

In T187217#4005384, @akosiaris wrote:

Why do we even need to create a debian package for shipping that single file ? I don't think it's worth the trouble. Can't we just ship it in the ORES repos ? How big is it btw ?

@akosiaris the file is of about 1.5GB in size. We decided against storing in ores git repo as it would bloat the repo size and tracking every new version would be a size issue.

Yes, given the size of that file it's a wise decision to not store it directly in the ores git repo. It would have been a disaster to do so. Shipping it as a debian package however is not really better either, for multiple reasons, starting from the fact it would cause puppet issues during installation (slow fetching, network congestions), bloating of the local filesystems as well as the debian repository filesystem, not to mention the effort going into the debianization which is really not worth it for shipping a single file.

How about using git-fat ? IIRC scap does support that already.

Halfak updated the task description. (Show Details)Feb 27 2018, 9:23 PM

Note that I've created T188446: Package word2vec binaries as a more specific sub-task regarding the packaging bits discussed above.

awight renamed this task from Support word2vec for production ORES models to [Epic] Support word2vec for production ORES models.Mar 5 2018, 3:47 PM

awight removed awight as the assignee of this task.

awight added a project: Epic.

awight moved this task from Review to Non-Epic on the Machine-Learning-Team (Active Tasks) board.

awight mentioned this in T188446: Package word2vec binaries.Mar 5 2018, 4:01 PM

• dcausse subscribed.Mar 12 2018, 4:17 PM

awight changed the status of subtask T188446: Package word2vec binaries from Open to Stalled.Mar 19 2018, 7:15 PM

awight closed subtask T188445: Implement word2vec featurevector in revscoring as Resolved.May 2 2018, 6:48 PM

awight closed subtask T188755: Update ORES requirements to support revscoring 2.2.0 as Resolved.

awight closed subtask T188446: Package word2vec binaries as Resolved.Jun 4 2018, 8:27 PM

awight closed subtask T188775: Re-train models with revscoring 2.2.0 as Resolved.Jun 7 2018, 1:38 PM

awight closed this task as Resolved.Jun 12 2018, 2:59 PM

awight claimed this task.

awight closed subtask T188447: Update ORES wheels for new revscoring requirements as Resolved.

akosiaris mentioned this in T288198: Pushes to docker-registry fail for images with compressed layers of size >1GB.Jul 24 2023, 9:43 AM

[Epic] Support word2vec for production ORES modelsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

[Epic] Support word2vec for production ORES models
Closed, ResolvedPublic
Actions

Related Objects
Search...