I would like to work with the wikitext-dump in hive. For efficient parsing of the wikitext (e.g. getting plain text etc) I have preivously used the packages mwparserfromhell/wikitextprocessor when working with the xml-dumps directly.
In order to re-use the existing pipelines with spark, would it be possible to install (one of) the packages to all the workers?
- mwparserfromhell https://pypi.org/project/mwparserfromhell/
- (if possible) wikitextprocessor https://pypi.org/project/wikitextparser/
From the related task T249078 it seems there is a more general solution planned but my understanding is that this is not yet ready. After reaching out to @JAllemandou and @Ottomata yesterday, I was told to file a ticket in case I need a specific package.
Background: this is part of the ongoing project on add-a-link T253279, in which we try to improve the algorithm for link recommendation