Things continuously evolve on Wikipedia Offline Scraping side.
We have during the last 18 months massively improved the performances of the MWoffliner scraper by allowing to create offline ZIM files "on the fly" without using much of fs IO and by optimising seriously CPU/RAM mgmt. There is still a few additional things to do, but most of what is easy to do has been done.
We have as well fully automated the ZIM generation using a scheduling platform called Zimfarm available at https://farm.openzim.org. This fully dockerized application allows us to have an autonomous ZIM building solution which rationally deals with the scrapes to do. This workload is then distributed on decentralised workers. The 5 mwoffliner VPS boxes are in charge of doing all the Wikimedia projects (in different ZIM flavours: no picture, no videos, etc..).
All of this works fine but we still struggle a bit to regenerate all the Wikimedia ZIM files on time once a month - which is the goal we agreed with the WMF. Last upgrade of our quota happened 4 years ago https://phabricator.wikimedia.org/T117095#2807242 but the amount of content to deal with increases constantly. This is why I come back to ask to update our quota.
We have 5 VPS VMs. mwoffliner4 and mwoffliner5, the one created in 2016 have the xlarge-xtradisk profile and we are really happy with them. But the 3 other ones (mwoffliner1, mwoffliner2, mwoffliner3) have pretty old/small(er) profiles which are not really adapted anymore to our usage. My request would be to homogenise the MWoffliner setup to xlarge-xtradisk for these 3 older ones as well.
If my computation is right, this should allow us to achieve our goal of regular monthly release for all the WMF ZIM files.
Here the format request:
Project Name: mwoffliner
Type of quota increase requested (we don't need extra floating ips):
- mwoffliner1: m1.xlarge -> xlarge-xtradisk
- mwoffliner2: m1.xlarge -> xlarge-xtradisk
- mwoffliner3: m1.large -> xlarge-xtradisk
- mwoffliner4: keep the same
- mwoffliner5: keep the same
- wp1: keep the same
Amount of quota increase: In a nutshell, more disk one additional xlarge-xtradisk
Reason: We need to be quicker (xlarge-xtradisk vcpus car twice as fast than others) and be able to make largest scrapes in parallel without coming short on disk