Page MenuHomePhabricator

Package word2vec binaries
Closed, ResolvedPublic

Description

The word2vec data is a 1.5GB file, that we'll need to deploy to all ORES compute nodes. The timeline for us using files of this size is:

  • We want the word2vec data ASAP, since it's blocking a production model.
  • Might never need to update that file, but we'll probably need it installed for years to come.
  • I expect that we'll have our own "embeddings" files of similar size, within a year.

Options for deployment:

  1. Deploy as a .deb.
  2. Deploy in the ORES git repo
    • Absolutely not, this would make our repo unusable.
  3. Deploy as its own git-lfs repo
    • Scap might be ready to handle git-lfs
    • @awight is not looking forward to being the rocket dog, first adopter for git-lfs in scap. We don't want to block deployment waiting for this kind of insfrastructure work, if it's not mature.

Event Timeline

@akosiaris I started outlining our options here, please chime in when you can.

So basically we are talking about 3 choices, all of them undesirable by at least a person, if not more.

Technically speaking and assuming a bug free world, the best one seems to be the git-lfs/git-fat way. Is it viable? I 'll admit no experience with it yet. @mmodell, @thcipriani, would it be a viable alternative?

So basically we are talking about 3 choices, all of them undesirable by at least a person, if not more.

Technically speaking and assuming a bug free world, the best one seems to be the git-lfs/git-fat way. Is it viable? I 'll admit no experience with it yet. @mmodell, @thcipriani, would it be a viable alternative?

git-lfs status:

SCAP + git-fat should be working and it's somewhat more well tested, so there's always that.

@awight is not looking forward to being the rocket dog, first adopter for git-lfs in scap. We don't want to block deployment waiting for this kind of insfrastructure work, if it's not mature.

Unfortunately, rocket dog is exactly what we are talking about here.

Well... @akosiaris what would you think about using the word2vec.deb provisionally, with a medium-term commitment to move to git-lfs/fat once we can prove that the deployment stack is stable? If I understand correctly, the problems with the Debian packaging strategy are temporary: a one-time hit to puppet, and 2x disk usage on each machine, which we would recover when we stop using the package. In this migration scenario, we'd also be duplicating the 1.5GB x 18 (27GB) of intranet traffic, one for the Debian deployment and again for the git-lfs/fat deployment.

I'm also happy to chat about ways to make the rocket dog more palatable, of course.

Temporary is the new permanent and by further reading, I understand that updates of that data would be rather infrequent, adding one less incentive in the long run to migrate away from the temporary solution. I 'd rather we avoided that approach and rather go with a better approach from the start.

Also note that aside from the problems I already mentioned, 1.5GB Debian packages are not by any means the "norm". In fact running some statistics on the debian package database, the mean is 1.3 MB with an standard deviation of 12MB. The median is barely 68KB. There are just 9 packages > 500MB (in a total of ~48K packages) and no packages above 700MB (and not we don't have that package installed it's http://www.redeclipse.net, which is definitely not something you expect on a server). Simply put, we may very well be pushing the solution to its limits and meeting problems in the process that end up consuming a significant amount of time.

@awight is not looking forward to being the rocket dog, first adopter for git-lfs in scap. We don't want to block deployment waiting for this kind of insfrastructure work, if it's not mature.

Unfortunately, rocket dog is exactly what we are talking about here.

I don't think we have any other project having those disk space requirements, so it's kind of expected that AI related software would be the pioneer in this.

Alright, thanks for talking this through. I like the idea of keeping these files in git-lfs, and I'm sure we can iron out any scap glitches. ASAP would be nice, but honestly, if we can get the binary deployed within the month we'll be happy.

How should we get started? I could create a git-lfs repo in GitLab and mirror it into Diffusion, if that sounds reasonable? A slight catch is that it isn't logical to hang the repo off of mediawiki/services/ores/deploy as a submodule, at least not yet, because it isn't a dependency of anything in that tree. Eventually, we'll deploy the drafttopic model which depends on word2vec, but I don't want to wait for that, because we've already identified word2vec deployment as a blocking dependency, so it should be made ready earlier.

Is it easy to deploy this repo independently? Does it make sense, to write scap configuration in this repo, etc.?

Another random thought: we can build our repo incrementally, to test scap+git-lfs at each step.

  1. Create a stub repo with scap config.
  2. Convert to git-lfs and add a small file.
  3. Add the large word2vec file.

Current status of git-lfs in phabricator: pending. Phabricator has the support, however, we need a storage back-end for the files. Using phabricator's database-backed storage doesn't seem wise given the size of files to be stored there. We'd either need to use gerrit only for the time being or wait for me to get swift up and running.

I'd be interested in creating a gerrit-only repo with this type of asset in it. We can always move it to phab later if we want.

We'll need to ask @demon what is needed to get a git-lfs repo on gerrit. I think he's out sick today, hopefully he's feeling better tomorrow.

@mmodell we support git-lfs in gerrit now. :)

Chad installed the plugin a few months ago, and i know the setup to enable repos.

Which repo do we want to enable this on?

Also how much storage?

@Paladox: I thought access to git-lfs was limited by something in gerrit. (edit) I think you confirmed my suspicion that it has to be enabled, per-repo.

@mmodell yeh, we add the repo in lfs.config in All-Projects. It's not automatically.

https://gerrit.wikimedia.org/r/#/projects/research/ores/wheels,dashboards/default is where we want it right now.

Would 50GB be too high of a ceiling? That's about 10x what we need right now. We will probably double our current usage in the next 3-6 months. So 10x seems like a good, long-term ceiling.

Change 416756 had a related patch set uploaded (by Paladox; owner: Paladox):
[All-Projects@refs/meta/config] Increase lfs resources on research/ores/wheels to 50gb

https://gerrit.wikimedia.org/r/416756

Great. Thank you. It looks like T180628: Install git-lfs client (at least on scap targets & masters) is our last blocker until we can start experimenting with this.

@Paladox Apologies, we discussed in IRC and we'd like to change the plan slightly. Please disable git-lfs on the wheels repo, and we'll create a new repo to experiment in.

Change 416758 had a related patch set uploaded (by Paladox; owner: Paladox):
[All-Projects@refs/meta/config] Disable lfs on research/ores/wheels

https://gerrit.wikimedia.org/r/416758

Change 416756 abandoned by Paladox:
Increase lfs resources on research/ores/wheels to 50gb

https://gerrit.wikimedia.org/r/416756

Change 416758 merged by 20after4:
[All-Projects@refs/meta/config] Disable lfs on research/ores/wheels

https://gerrit.wikimedia.org/r/416758

Just following up: 50gb is fine for now sure. We've got several TB of free space :)

D1000 bumps scap version to 3.7.7 and adds git-lfs support ...we still need git-lfs packages on all relevant servers

I see we have git-lfs on ores*.eqiad.wmnet, so we're almost ready to give this a try.

What I'm missing is the scoring/ores/assets repo (request submitted for its creation), and @akosiaris I would like to deploy this repo onto ores*, separately from the current ores deployment so that we can test LFS in isolation.

I see we have git-lfs on ores*.eqiad.wmnet, so we're almost ready to give this a try.

What I'm missing is the scoring/ores/assets repo (request submitted for its creation), and @akosiaris I would like to deploy this repo onto ores*, separately from the current ores deployment so that we can test LFS in isolation.

Nice. Per T180628#4045978 let's do some testing first in labs/beta before we start testing directly in production. We don't want to cause an outage testing untested code directly in production

Nice. Per T180628#4045978 let's do some testing first in labs/beta before we start testing directly in production. We don't want to cause an outage testing untested code directly in production

Good thing to keep track of, here's my update:

  • Tracking tasks specific to git-lfs under T180627: Support git-lfs in scap.
  • Branched mediawiki/services/ores/deploy as git-lfs, and I've added scoring/ores/assets as a submodule. In the assets repo, I'm experimenting with adding LFS on a branch.
  • Only deploying these branches to deployment-ores01.
awight renamed this task from Package word2vec binaries to [Blocked] Package word2vec binaries.Mar 19 2018, 7:15 PM
awight changed the task status from Open to Stalled.

@mmodell, @ArielGlenn, and @demon said in IRC that we need to wait for ICU updates for stretch. And that will require waiting until April 9th (+ a few days)

To clarify:
deployment servers to stretch requires use of php7; this means use of the icu57 libraries; this is incompat with the icu52 libs we use everywhere now.
Apr 9 is set to the the migration date for this, things can slip but that's more or less what we're looking at.
After that of course is working out things like mwscript, foreachwiki etc.

awight renamed this task from [Blocked] Package word2vec binaries to Package word2vec binaries.Jun 4 2018, 8:27 PM
awight closed this task as Resolved.