Provide large disk space to WikiBrain for memory-mapped file
Open, Stalled, NormalPublic

Description

WikiBrain is a great technology that brings a bunch of AI research back to Wikipedia. One of the key technologies to having things like semantic-relatedness on-demand is being able to have very large memory-mapped files. These are currently running in the 200GB range for English Wikipedia. Because the largest labs VM can only support up to 120GB of disk space, WikiBrain is currently running on University of Minnesota hardware. We should have this awesome resource on labs.

Relevant to T96950 and T155853

See also https://meta.wikimedia.org/wiki/Grants:IEG/WikiBrainTools and http://atlasify.northwestern.edu/

Halfak created this task.Mar 27 2017, 7:54 PM
Restricted Application added a project: Labs. · View Herald TranscriptMar 27 2017, 7:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
bd808 triaged this task as Normal priority.Mar 27 2017, 9:24 PM
bd808 added a subscriber: bd808.

Custom image sizes are certainly possible in our OpenStack deployment.

Can you give more concrete sizing needs? Would the project need terabytes of storage? Can we start small and scale up? What will the community get back for investing in this project?

Halfak added a comment.Apr 4 2017, 1:40 PM

@Shilad can probably answer better than I can on most points.

It seems that, right now, at least 200GB is essential. I expect that might double over 5-10 years for a wiki. So maybe 500GB per VM would work OK.

I'd like to use semantic relatedness for recommending articles to editors/readers that align with their interests and current reading session. I think it will also be interesting for many different analyses of editor behavior. E.g. tracking the relatedness of articles and editor works on might help us understand better how interests change over time. I'd also like to see the tools that these researchers are developing on labs so that we can have a chance to maintain them long after the researchers move forward from their current work.

Halfak added a comment.Apr 4 2017, 1:45 PM

FYI: https://en.wikipedia.org/w/index.php?diff=773800647

Hey The Transhumanist! I think WikiBrain is a great project. I really value that the researchers have been working hard to get WikiBrain working on WMF infrastructure so that it will be easier for Wiki developers to work with. Regretfully, we're currently blocked on getting a decent allocation of disk space. It turns out that WikiBrain needs a single large harddrive allocation in order to store and use a big index file to work from. See Phab:T161554. The task is pretty recent, but I've been trying to work out this harddrive allocation problem for years for this project and some related ones.

Halfak updated the task description. (Show Details)Apr 19 2017, 7:56 PM
Halfak renamed this task from Provide large disk space to wikibrain for memory-mapped file to Provide large disk space to WikiBrain for memory-mapped file.Apr 19 2017, 8:14 PM

Just to follow up on this. Aaron's estimates are pretty accurate. The disk cached data structures require about 200GB for larger language editions right now. We would likely expand to 500GB over time (or if we require "more advanced" WikiBrain features). Is this possible?

Thanks for your help!

bd808 added a comment.Apr 20 2017, 5:11 AM

It seems that, right now, at least 200GB is essential. I expect that might double over 5-10 years for a wiki. So maybe 500GB per VM would work OK.

VMs should be much more ephemeral than a 5-10 year lifetime. Even our physical hardware is only expected to have a 5 year lifespan.

Just to follow up on this. Aaron's estimates are pretty accurate. The disk cached data structures require about 200GB for larger language editions right now. We would likely expand to 500GB over time (or if we require "more advanced" WikiBrain features). Is this possible?

It is not clear to me what the number of VMs that you would need is. Are we talking about one VM with >200GB of storage or 10 or 100?

I think Aaron was saying that although 200GB would probably work right now, it would't hold Wikipedia for very long. 500GB would definitely last for 5 years. Somewhere in between those sizes would work for a few years. Two of the large stoarage VMs would be plenty initially.

Thanks!

Andrew added a subscriber: Andrew.Apr 21 2017, 3:13 PM

So, this question relates to both storage needs and also the appropriateness of Labs use: Is this giant storage use something persistent and valuable, or more like a scratch-pad? That is, if we create a 250Gb instance today and then in 2019 you need a 400 Gb instance to handle growth, can you just throw out the old instance and make a new one? Or is the actual storage on the old instance valuable and hard to reproduce such that you'd have to copy or save the file somehow?

Because I hate it when people store valuable/unrecoverable data on labs :)

Good questions! The big files are statistical models. So they take a while to build (a day or two), but they can be easily recreated. I think your suggestion of swapping the VMs over time seems reasonable. My only thought is if we could have a little more wiggle room... perhaps 300GB.. that would substantially reduce the rate at which we had to turn over the images.

Thanks!

OK, sounds good. I'll try to do some capacity assessment and bring this up at our next meeting.

Be warned, though, that this may have to wait until we get some more hardware racked -- we've ordered a bunch of new virtualization capacity but the world SSD economy is broken and it's unclear when the ones we need will be available.

Do you know what kind of CPU/RAM you'll need?

Great! 24GB of memory and 4 cores would be great if that works for you.

Thank you again!

chasemp changed the task status from Open to Stalled.May 2 2017, 3:27 PM