Page MenuHomePhabricator

Resource allocation request for the wikicommunityhealth project
Closed, ResolvedPublic

Description

Project Name: wikicommunityhealth

Wikitech Usernames of requestors: CristianCantoro, marcmiquel, davids (sdivad)

Purpose: these resources will be allocated for a back-end server and a front-end server to be used in the scope of the project Community Health Metrics: Understanding Editor Drop-off. We will need to parse dumps and retrieve data from multiple Wikipedias, from the MediaWiki History dump, the WikiConv dataset.

Brief description: the project will need to retrieve and analyze several datasets and process them.

The project was created in T267162.

Here's the resources that we would need:

  • disk space: 2.1TB (2TB+100GB)
  • RAM: 80GB (64BG+16GB)
  • number of cores: 20 (16+4)

We are available for any further clarification.

Event Timeline

Thanks for the request! Can you describe how big the datasets you intend to process are? 2.1TB is a lot of space. How many instance are you looking to run / and what will they be used for?

For reference, the current project quota is:
8 instances
8 vCPU
16G RAM

Hi @nskaggs,

thanks for your question. At the moment we are processing the following datasets:

  • MediaWiki history dumps
  • Wikipedia XML dumps
  • WikiConv dataset

the analysis that we do are applied to all languages, with the exclusion of WikiConv for which we only focus on ca, en, es, it.

We are computing several metrics from each dataset. The biggest contribution of course comes from WikiConv, since we need to download and process it.

If I remember correctly we produce:

  • 60 GB from processing the MediaWiki History Dump for revert metrics;
  • 20 GB from processing the MediaWiki History Dump for lifecycle metrics;
  • 500 GB from processing the WikiConv dataset for discussion metrics and sentiment analysis;
  • 1 GB from processing the Wikipedia XML dump for user warnings and wikibreaks;

Then we thought that if we want to have updates we should have space for 2 iterations of these computations and a little bit of margin.
We can re-evaluate our needs when we have deployed our system (in 3 months, September 18th) and reduce the resources we consume if we see that we can do with less.

Cristian

After looking at available resources, unfortunately given the size of the request, that much extra disk space is not something we can grant. Is it possible for you to instead stream the data you are looking to process rather than download and retain copies? As I understand it, streaming datasets is possible and done by others within the movement. However, I'm unsure of the details or if it's applicable to your use case.

https://github.com/mediawiki-utilities/python-mwxml might be useful for processing dumps as streams. Maybe @Halfak can point to a how-to somewhere that shows how to combine this with streaming bz2 file reads as well?

I suggest referencing https://pythonhosted.org/mwxml/map.html#mwxml.map

This function takes a list of paths to dump files and it will perform a streaming decompression and XML reading pattern to allow you to parallel process XML dump files.

See https://github.com/wikimedia/articlequality/blob/master/articlequality/utilities/extract_labelings.py#L86 for an example usage. Essentially, you pass a function that processes an mwxml.Dump file and yields whatever outputs you want along with a list of dump file locations. You'll get a generator that collects and outputs whatever was yielded by the functions.

Hi,

Thanks for the pointers, however, let me clarify a little bit the background of our request:

  1. we are already working with mwxml and we are already working in a streaming fashion on the dumps. (Also, let me point out that all the code we are writing is public in this GitHub organization;
  2. the main problem in terms of storage is the WikiConv dataset, which is quite large, as it is a collection of all discussions from a given Wikipedia. We can try to find another solution for at least preprocessing this dataset so that we are able to do everything with less data;
  3. at this moment, it is quite urgent for us to be able to deploy the servers and test our code, so we would like to find a reasonable compromise to be able to deploy our system.

I would like to update our request to:

  • 600GB of storage
  • 48GB of RAM
  • 20 cores

As said above, we are open to re-evaluate our needs in a short period of time.

Hello,

Sorry to insist, but it has been a little more than a month since our first request, and we would need to have an infrastructure to process, store and visualize this data for our research. Cristian's estimations are a minimum to get started in the Wikimedia server.

For some data, we may need to more space in the future (maybe the RAM and cores can be scaled down).

I would like to ask you please to set a meeting to re-evaluate or discuss the server characteristics on September 1st, but now it is really important that we can set it up in order to do a "full cycle" of data processing and see how our scripts run in the server.

Thank you.

Mentioned in SAL (#wikimedia-cloud) [2021-07-14T20:22:28Z] <balloons> set quota to 600gb, 20 cores, 48g RAM T284687

I have granted the revised request quota mentioned here (https://phabricator.wikimedia.org/T284687#7198040). The project quota has been set to 600 GB, 20 cores, 48G of RAM.

WMCS stewards a limited set of shared resources and caters to a large audience of community members with varying needs and capabilities. It's important that we are mindful and careful to ensure equitable and fair use of these resources towards the greater good of the movement. We personally would love to say yes to everyone and every project! We're thrilled when people want to utilize our platforms! However, given our limited resources we cannot always meet the needs of every project or request.

Our response to your request ( https://phabricator.wikimedia.org/T284687#7173011 ) was that WMCS would not be able to meet your needs and encouraged you find alternatives. Specifically, the amount of storage requested would be hard to fulfill. I want to be clear that large requests, especially of storage, are difficult to accommodate and must be weighed against other community projects needs and fair utilization of resources. This is especially true when such requests have a short deadline or are not part of our capacity planning.

Congratulations of your work being selected as a project grant by WMF! Reading your proposal, I'm not sure what your plans are for your infrastructure needs. Given your above commentary about further needs, I want to make you aware that while WMCS is a resource available to you (just like anyone who is a part of the movement), we do not know what specific needs you will have as part of this project, nor have we planned to meet those unknown needs. In short, to be clear, please do not expect that every request can be or will be fulfilled.

Understanding the above, I'd be happy to discuss other needs you may have and if WMCS can help. Engaging on a longer timeframe and letting us understand how we can help your work be successful increases our ability to help if we're able to do so. Best of luck in your work.

Thank you very much for granting us these resources and for your kind words, @nskaggs. We are totally aware of the circumstances, and we hope to get more clarity on the final resources that will be necessary for the project to run and be sustainable. We will get back to you with a longer message about our future/long-term plans so that we can discuss them.
Best,
Marc

Hi @nskaggs. Sorry to ping you again. I was discussing the set-up with the rest of the team, and we'd like to know if you could give us a hand in re-arranging the server capacities. We think it would be better to split it in two servers: front-end and back-end.

  • Frontend
    • disk: 40GB
    • RAM: 8GB
    • cores: 4
  • Backend
    • disk: 560GB
    • RAM: 40GB
    • cores: 16

We've tried to do it ourselves, but apparently we cannot do backups to move our data. It is not that we have much data at the moment (4GB), because we haven't started using the server. However, we prepared some configurations that we would like to keep.

Do you think it could be possible to move a snapshot of the current server to the new Front-end server and resize the current server to be the Back-end?

My experience with sysadmin is quite limited, so perhaps there is a different way to do it. But in essence, we'd like not to lose the data+config of the current server and use it in the front-end.

Please, let us know if you could do this for us. Again, thank you very much!

We don't currently support snapshotting/duplicating instances. You should be able resize an existing instance using the Horizon UI (to make it bigger), but if you want another host with the same setup the easiest path is for you to just create a new one and duplicate your previous setup.

I assume that your data is already stored on a cinder volume; if that's correct then you can detach/reattach that volume to whatever server ultimately needs access to that data.

Hope that helps!

nskaggs claimed this task.

I'm going to go ahead and close this as resolved as further quota changes would need to happen in a new request anyway. It seems like the initial creation of the project is complete.