Provide large disk space to WikiBrain for memory-mapped file
Open, NormalPublic

Description

WikiBrain is a great technology that brings a bunch of AI research back to Wikipedia. One of the key technologies to having things like semantic-relatedness on-demand is being able to have very large memory-mapped files. These are currently running in the 200GB range for English Wikipedia. Because the largest labs VM can only support up to 120GB of disk space, WikiBrain is currently running on University of Minnesota hardware. We should have this awesome resource on labs.

Project info: Create a VM to host the Wikibrain web service and visualization
Project Name: wikibrain-api
Purpose: Provide a hosted algorithmic API
Wikitech username of requestor: Shilad Sen
Brief description: (See above)
How soon you are hoping this can be fulfilled: ASAP
HW Specification: 4 cores/24GB ram/300GB

Relevant to T96950 and T155853

See also https://meta.wikimedia.org/wiki/Grants:IEG/WikiBrainTools and http://atlasify.northwestern.edu/ and http://cartograph.info

Halfak created this task.Mar 27 2017, 7:54 PM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptMar 27 2017, 7:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
bd808 triaged this task as Normal priority.Mar 27 2017, 9:24 PM
bd808 added a subscriber: bd808.

Custom image sizes are certainly possible in our OpenStack deployment.

Can you give more concrete sizing needs? Would the project need terabytes of storage? Can we start small and scale up? What will the community get back for investing in this project?

Halfak added a comment.Apr 4 2017, 1:40 PM

@Shilad can probably answer better than I can on most points.

It seems that, right now, at least 200GB is essential. I expect that might double over 5-10 years for a wiki. So maybe 500GB per VM would work OK.

I'd like to use semantic relatedness for recommending articles to editors/readers that align with their interests and current reading session. I think it will also be interesting for many different analyses of editor behavior. E.g. tracking the relatedness of articles and editor works on might help us understand better how interests change over time. I'd also like to see the tools that these researchers are developing on labs so that we can have a chance to maintain them long after the researchers move forward from their current work.

Halfak added a comment.Apr 4 2017, 1:45 PM

FYI: https://en.wikipedia.org/w/index.php?diff=773800647

Hey The Transhumanist! I think WikiBrain is a great project. I really value that the researchers have been working hard to get WikiBrain working on WMF infrastructure so that it will be easier for Wiki developers to work with. Regretfully, we're currently blocked on getting a decent allocation of disk space. It turns out that WikiBrain needs a single large harddrive allocation in order to store and use a big index file to work from. See Phab:T161554. The task is pretty recent, but I've been trying to work out this harddrive allocation problem for years for this project and some related ones.

Halfak updated the task description. (Show Details)Apr 19 2017, 7:56 PM
Halfak renamed this task from Provide large disk space to wikibrain for memory-mapped file to Provide large disk space to WikiBrain for memory-mapped file.Apr 19 2017, 8:14 PM

Just to follow up on this. Aaron's estimates are pretty accurate. The disk cached data structures require about 200GB for larger language editions right now. We would likely expand to 500GB over time (or if we require "more advanced" WikiBrain features). Is this possible?

Thanks for your help!

It seems that, right now, at least 200GB is essential. I expect that might double over 5-10 years for a wiki. So maybe 500GB per VM would work OK.

VMs should be much more ephemeral than a 5-10 year lifetime. Even our physical hardware is only expected to have a 5 year lifespan.

Just to follow up on this. Aaron's estimates are pretty accurate. The disk cached data structures require about 200GB for larger language editions right now. We would likely expand to 500GB over time (or if we require "more advanced" WikiBrain features). Is this possible?

It is not clear to me what the number of VMs that you would need is. Are we talking about one VM with >200GB of storage or 10 or 100?

I think Aaron was saying that although 200GB would probably work right now, it would't hold Wikipedia for very long. 500GB would definitely last for 5 years. Somewhere in between those sizes would work for a few years. Two of the large stoarage VMs would be plenty initially.

Thanks!

Andrew added a subscriber: Andrew.Apr 21 2017, 3:13 PM

So, this question relates to both storage needs and also the appropriateness of Labs use: Is this giant storage use something persistent and valuable, or more like a scratch-pad? That is, if we create a 250Gb instance today and then in 2019 you need a 400 Gb instance to handle growth, can you just throw out the old instance and make a new one? Or is the actual storage on the old instance valuable and hard to reproduce such that you'd have to copy or save the file somehow?

Because I hate it when people store valuable/unrecoverable data on labs :)

Good questions! The big files are statistical models. So they take a while to build (a day or two), but they can be easily recreated. I think your suggestion of swapping the VMs over time seems reasonable. My only thought is if we could have a little more wiggle room... perhaps 300GB.. that would substantially reduce the rate at which we had to turn over the images.

Thanks!

OK, sounds good. I'll try to do some capacity assessment and bring this up at our next meeting.

Be warned, though, that this may have to wait until we get some more hardware racked -- we've ordered a bunch of new virtualization capacity but the world SSD economy is broken and it's unclear when the ones we need will be available.

Do you know what kind of CPU/RAM you'll need?

Great! 24GB of memory and 4 cores would be great if that works for you.

Thank you again!

chasemp changed the task status from Open to Stalled.May 2 2017, 3:27 PM

We have a new virt server set up that should be able to handle this... I might need to create your VMs by hand but I'll try to follow up this week.

bd808 changed the task status from Stalled to Open.Jul 11 2017, 5:28 PM

@Halfak we have the new labvirt hosts online now that this was blocked on!

Could you please update the main summary here to provide the requested data and format for a new project? This helps us keep track of things a little better. Please also include the description of the custom image you need. It looks like we may have settled on 4 cores/24GB ram/300GB disk in the discussion above.

I may need to create these VMs by hand, as I'm having trouble getting the host-aggregate feature to work properly. What would you like me to call them?

@Shilad, can you confirm?

I recommend calling the big API server something like "wikibrain-api-01" or maybe "wikibrain-enwiki-01" or something like that -- depending on whether or not a single. machine will be needed for all of enwiki.

There should probably be separate VMs for the various visualization experiments that will use the API/embeddings.

For the separate VMs, I'm imagining something like "wikibrain-atlasify-01" or something like that.

Thanks @Halfak and @Andrew This is exciting!

wikibrain-host-01 sounds great for the server itself, and the vms could be wikibrain-en-01 and wikibrain-viz-01.

I'll update the summary with the data & format info you ask for right now.

Shilad updated the task description. (Show Details)Jul 12 2017, 5:40 AM
Shilad updated the task description. (Show Details)

Also, I'll probably be using Docker images (we have a WikiBrain docker image). I presume that it's better to run the Docker image in a VM rather than on the host, but please let me know if that's not correct.

Andrew added a comment.Mon, Aug 7, 3:29 PM

OK, after a quick chat with Aaron, I've created two big VMs for you:

wikibrain-embeddings-01
wikibrain-embeddings-02

I also adjusted your RAM quota so that you can create two more smaller VMs (either size small or medium depending on your needs.) Please use debian-stretch for all your instances.

Note that by default most of the disk space is not mounted for a new VM. To partition that space you'll need to apply something like the role::labs::lvm::srv puppet class.