Page MenuHomePhabricator

Toolforge node for constraint reports updating bot
Closed, DeclinedPublic

Description

My bot updates reports at https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations

I want to move the bot to Toolforge because my host has insufficient amount of RAM for growing Wikidata DB. Now bot uses ~ 64 GB of virtual memory. But my host has only 32 GB of RAM. So performance of the bot became too bad.

Bot is written on C++, it loads the latest Wikidata dump into memory and uses this data for the reports generation. I already done multiple optimizations of memory usage.

Discussion start:
https://toolsadmin.wikimedia.org/tools/membership/status/262

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

You should start by publishing your source code. I asked this before multiple times at https://www.wikidata.org/wiki/User_talk:Ivan_A._Krestinin#Publish_the_source_of_KRBot_please . That way people can help to improve your code etc. According to the Wikimedia Cloud/Toolforge rules you're not allowed to run non open source code anyway (https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use).

Toolforge does not have "nodes" for particular tools. It is a shared grid environment where jobs are submitted either to a Grid Engine worker farm or to a Kubernetes cluster. Neither of these environments is going to be a great fit for a process that wants 64G of ram for long periods of time.

At https://toolsadmin.wikimedia.org/tools/membership/status/262 I asked you if this was a continuous need or a peak processing burst, and you said "Bot`s algorithm loads Wikidata dump into memory and uses this data during several hours to generate all needed reports. So 64 GB is used continuously for several hours every day." This is sounding more like something that either needs to be optimized to use less memory (4G or so is 'normal' for a tool) or you need to file a Cloud-VPS (Project-requests) ticket to ask for dedicated resources for your tool.

It would be great to see you work with @Multichill a bit before rushing to try and get a dedicated server.

I’m also curious why the bot needs that much RAM, and would very much like to see the source code :) I fully expect that the bot needs a long time to run, but right now I can’t think of anything that would require holding a large number of entities in RAM at the same time.

Bot parses full Wikidata dump (844 GB of XML files) and load items and its properties into memory. This in memory data is used for reports generation. Now Wikidata has ~47000000 items. So my code uses ~144 bytes per item. I can not load only part of data because dumps parsing is long and sequential process (~5 hours, 4 threads are used).

Another approach is restoring dumps to SQL DB or some similar engine and generating reports using requests to the DB. On Toolforge bot can access to some existing database as I know. So dumps restoring step can be skipped. But it is hard for me to predict result performance of such approach. For example DB engines usually has bad support of regular expressions. So I need to select all property values for Format constraint check. It is ~1.5 GB of data for property P2093. This is single property, but the most properties has constraints. Single bot run will touch 90% of all information stored in the database. So database will be aggressively used by bot during long time every day.

I’m going to mark this as Stalled, since I don’t think there’s much that can be done here until the bot’s source code is published – that’s a necessary requirement for running it on any Cloud Services infrastructure, whether on Toolforge or on Cloud VPS.

Declining per my prior response in T189747#4051897. A Cloud VPS project could be requested to provide dedicated resources, but there is no dedicated Toolforge "node" option.

A Cloud VPS project request is also unlikely to be approved without further details about the source code, its license, and the general value to the Wikidata/Wikibase community of this tool.