Page MenuHomePhabricator

install mwparserfromhell on spark for efficient usage of wikitext-dump in hive
Closed, ResolvedPublic

Description

I would like to work with the wikitext-dump in hive. For efficient parsing of the wikitext (e.g. getting plain text etc) I have preivously used the packages mwparserfromhell/wikitextprocessor when working with the xml-dumps directly.

In order to re-use the existing pipelines with spark, would it be possible to install (one of) the packages to all the workers?

From the related task T249078 it seems there is a more general solution planned but my understanding is that this is not yet ready. After reaching out to @JAllemandou and @Ottomata yesterday, I was told to file a ticket in case I need a specific package.

Background: this is part of the ongoing project on add-a-link T253279, in which we try to improve the algorithm for link recommendation

Event Timeline

Oh cool!

I'd like to try our Anaconda-wmf approach for this if we can; as it will be the same approach we use for other packages like this. @elukey would you be ok If we installed anaconda-wmf on all Hadoop workers?

We should also check to make sure that the anaconda-wmf buster package with mwparserfromhell will work on the stretch workers. If not, we should build a stretch specific anaconda-wmf first.

@Ottomata that would be great. do you have an indication when you are going to try this out. I am trying to anticipate whether I should wait or try workarounds using a different approach (would much prefer the first option : )

Change 626448 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Install anaconda-wmf on hadoop workerrs and clients

https://gerrit.wikimedia.org/r/626448

Change 626448 merged by Ottomata:
[operations/puppet@production] Install anaconda-wmf on hadoop workerrs and clients

https://gerrit.wikimedia.org/r/626448

@Ottomata thanks, this works. I tried jupyterhub with the anaconda base-env on stat1008 and was able to use mwparserfromhell with spark to parse the wikitext-table on hive.