WikiBrain is a great technology that brings a bunch of AI research back to Wikipedia. One of the key technologies to having things like semantic-relatedness on-demand is being able to have very large memory-mapped files. These are currently running in the 200GB range for English Wikipedia. Because the largest labs VM can only support up to 120GB of disk space, WikiBrain is currently running on University of Minnesota hardware. We should have this awesome resource on labs.
|· · ·|
|Open||None||T76375 New Labs project requests (tracking)|
|Open||None||T161554 Provide large disk space to WikiBrain for memory-mapped file|
|· · ·|
Custom image sizes are certainly possible in our OpenStack deployment.
Can you give more concrete sizing needs? Would the project need terabytes of storage? Can we start small and scale up? What will the community get back for investing in this project?
@Shilad can probably answer better than I can on most points.
It seems that, right now, at least 200GB is essential. I expect that might double over 5-10 years for a wiki. So maybe 500GB per VM would work OK.
I'd like to use semantic relatedness for recommending articles to editors/readers that align with their interests and current reading session. I think it will also be interesting for many different analyses of editor behavior. E.g. tracking the relatedness of articles and editor works on might help us understand better how interests change over time. I'd also like to see the tools that these researchers are developing on labs so that we can have a chance to maintain them long after the researchers move forward from their current work.
Hey The Transhumanist! I think WikiBrain is a great project. I really value that the researchers have been working hard to get WikiBrain working on WMF infrastructure so that it will be easier for Wiki developers to work with. Regretfully, we're currently blocked on getting a decent allocation of disk space. It turns out that WikiBrain needs a single large harddrive allocation in order to store and use a big index file to work from. See Phab:T161554. The task is pretty recent, but I've been trying to work out this harddrive allocation problem for years for this project and some related ones.
Just to follow up on this. Aaron's estimates are pretty accurate. The disk cached data structures require about 200GB for larger language editions right now. We would likely expand to 500GB over time (or if we require "more advanced" WikiBrain features). Is this possible?
Thanks for your help!
VMs should be much more ephemeral than a 5-10 year lifetime. Even our physical hardware is only expected to have a 5 year lifespan.
It is not clear to me what the number of VMs that you would need is. Are we talking about one VM with >200GB of storage or 10 or 100?
I think Aaron was saying that although 200GB would probably work right now, it would't hold Wikipedia for very long. 500GB would definitely last for 5 years. Somewhere in between those sizes would work for a few years. Two of the large stoarage VMs would be plenty initially.
So, this question relates to both storage needs and also the appropriateness of Labs use: Is this giant storage use something persistent and valuable, or more like a scratch-pad? That is, if we create a 250Gb instance today and then in 2019 you need a 400 Gb instance to handle growth, can you just throw out the old instance and make a new one? Or is the actual storage on the old instance valuable and hard to reproduce such that you'd have to copy or save the file somehow?
Because I hate it when people store valuable/unrecoverable data on labs :)
Good questions! The big files are statistical models. So they take a while to build (a day or two), but they can be easily recreated. I think your suggestion of swapping the VMs over time seems reasonable. My only thought is if we could have a little more wiggle room... perhaps 300GB.. that would substantially reduce the rate at which we had to turn over the images.
OK, sounds good. I'll try to do some capacity assessment and bring this up at our next meeting.
Be warned, though, that this may have to wait until we get some more hardware racked -- we've ordered a bunch of new virtualization capacity but the world SSD economy is broken and it's unclear when the ones we need will be available.
Do you know what kind of CPU/RAM you'll need?