Maniphest T262044

install mwparserfromhell on spark for efficient usage of wikitext-dump in hive
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	MGerlach
	Sep 4 2020, 1:53 PM

Description

I would like to work with the wikitext-dump in hive. For efficient parsing of the wikitext (e.g. getting plain text etc) I have preivously used the packages mwparserfromhell/wikitextprocessor when working with the xml-dumps directly.

In order to re-use the existing pipelines with spark, would it be possible to install (one of) the packages to all the workers?

mwparserfromhell https://pypi.org/project/mwparserfromhell/
(if possible) wikitextprocessor https://pypi.org/project/wikitextparser/

From the related task T249078 it seems there is a more general solution planned but my understanding is that this is not yet ready. After reaching out to @JAllemandou and @Ottomata yesterday, I was told to file a ticket in case I need a specific package.

Background: this is part of the ongoing project on add-a-link T253279, in which we try to improve the algorithm for link recommendation

Details

	Subject	Repo	Branch	Lines +/-
	Install anaconda-wmf on hadoop workerrs and clients	operations/puppet	production	+6 -1

Customize query in gerrit

Related Objects

Mentioned Here: T249078: Desired packages to be installed/upgraded on the PySpark cluster (jupyterhub)
T253279: Add a link: algorithm improvements

Event Timeline

MGerlach created this task.Sep 4 2020, 1:53 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 4 2020, 1:53 PM

JFTR, it's packaged in Debian as well: https://packages.qa.debian.org/m/mwparserfromhell.html

Oh cool!

I'd like to try our Anaconda-wmf approach for this if we can; as it will be the same approach we use for other packages like this. @elukey would you be ok If we installed anaconda-wmf on all Hadoop workers?

We should also check to make sure that the anaconda-wmf buster package with mwparserfromhell will work on the stretch workers. If not, we should build a stretch specific anaconda-wmf first.

@Ottomata that would be great. do you have an indication when you are going to try this out. I am trying to anticipate whether I should wait or try workarounds using a different approach (would much prefer the first option : )

Ottomata edited projects, added Analytics; removed Analytics-Clusters.Sep 10 2020, 4:11 PM

Change 626448 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Install anaconda-wmf on hadoop workerrs and clients

https://gerrit.wikimedia.org/r/626448

gerritbot added a project: Patch-For-Review.Sep 10 2020, 5:05 PM

Change 626448 merged by Ottomata:
[operations/puppet@production] Install anaconda-wmf on hadoop workerrs and clients

https://gerrit.wikimedia.org/r/626448

Maintenance_bot removed a project: Patch-For-Review.Sep 10 2020, 6:10 PM

@Ottomata thanks, this works. I tried jupyterhub with the anaconda base-env on stat1008 and was able to use mwparserfromhell with spark to parse the wikitext-table on hive.

That's great stuff!!!!!

• Nuria closed this task as Resolved.Sep 17 2020, 4:16 PM

install mwparserfromhell on spark for efficient usage of wikitext-dump in hiveClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

install mwparserfromhell on spark for efficient usage of wikitext-dump in hive
Closed, ResolvedPublic
Actions