Page MenuHomePhabricator

Request creation of wikiwho VPS project
Closed, ResolvedPublic


Project Name: wikiwho (to replace the current which is part of the commtech project)

Wikitech Usernames of requestors: MusikAnimal, Dmaza, HMonroy, Samwilson, Daimona Eaytoy

Purpose: To bring the WikiWho article attribution service in-house, which is used by XTools, Who-Wrote-That, Education-Program-Dashboard, among other applications and researchers.

Brief description: This is the first step of T288840: Migrate WikiWho service to VPS. Longer-term, we might try to get WikiWho on production, but for now we'd like to mimic the current external setup so we can keep WikiWho running without interruption. T288840 goes into more detail in regards to the stack, but in short, we need something very big!

The WikiWho team has said their current setup involves a single server with 24 CPU cores and 122 GB RAM. Additionally, there are three mounted disks, which I assume can be Cinder volumes:

  • Database: 4 TB Postgres, partitioned
    • The actual space currently used is only 3.2 TB. I put 4 TB to give it room to grow.
    • This stores editor persistence information, which WikiWho maintainers said could be omitted if we don't care about it. This is not to my knowledge needed by any WMF or WikiEdu product, but other consumers of WikiWho may be relying on it.
    • I believe this db also stores credentials for API access, which we will need, but that will only require a very small amount of space.
  • Python Pickle disk: 5 TB
    • This stores one pickle file for each article. English Wikipedia by itself consumes about 2.5 TB, but the other four languages currently supported aren't nearly as big: 541 GB (German), 397 GB (Spanish), 66 GB (Turkish), 25 GB (Basque).
    • If necessary we likely can shave this down to 4 TB, but 5 would give us room to add more languages and allow the existing ones to grow.
  • Revision dumps: 6 TB
    • I assume this is only needed temporarily when first importing a new wiki, since the attribution data all lives in the pickle files. After the initial import, the system reads EventStreams and appends to the pickle files as new revisions are created. So we probably only need a disk the size of English Wikipedia's dump, uncompressed.

How soon you are hoping this can be fulfilled: Sometime within a month or two (October-November 2021), ideally, to give us enough breathing room before the WikiWho service (probably) shuts down in early 2022.

We realize we're requesting an exceptionally large amount of quota. It may be that we don't even have the hardware to accommodate this right now, or that VPS isn't the best home for this service, even in the short-term. So I guess this task is more about getting the conversation started. In the meantime we're going to try to get a rough headcount on all the stakeholders of WikIWho, as well as talk with WMF management, after which we'll have a better idea of whether this amount of storage is really justified. For now we'd like to hear what Cloud Services can do for us, if anything. Thank you for your time!

Event Timeline

@nskaggs I spoke with WikiWho maintainers and they confirmed most all of the hardware is owned by Gesis, so it's difficult to estimate what this would cost from a cloud hosting provider. They also confirmed giving us an "account" so-to-speak on their current hardware is not possible. The good news is some of the info I was given was outdated or included disk space that we don't need. The PostgresDB for WikiWho is only 3.2 TB compared to the 14 TB I originally wrote (!!!). Again, I believe that specifically can be skipped since it's for editor persistence data, which we're considering "optional" for now. I also confirmed the 122 GB figure was the RAM. The server code itself I doubt consumes much disk space. I'm waiting to hear back on the revision dump disk too, and will update this task accordingly once I know more.

@MusikAnimal thanks for getting those answers! I see you updated the description.

WMCS would love to help but it is a large resource ask. I can't promise we'll have the resources to fulfill it. Your revised list of needs certainly makes it much easier, as storage is the primary concern.

For reference, I did a really quick estimate using AWS tool:

One more thing to note. Assuming 15 TB in total of required space, with each cephOSD adding about 5TB of space, this could be supported with 3 additional cephOSD nodes.

nskaggs added a subscriber: nskaggs.

This has my +1 to move forward as a project, though with no solution yet for the required storage volumes mentioned. As we spoke on IRC, this will allow you to get started while the conversation on storage needs can continue.

Created the project with the requested RAM and CPU:

dcaro@cloudcontrol1003:~$ sudo wmcs-openstack quota show  wikiwho | egrep '(cores|ram)'
| cores                 | 21                                                                                                                                                                  |
| ram                   | 124928

And the default storage until that gets sorted out, when it does, please open a new task with the quota request.

Added @MusikAnimal as admin.