Page MenuHomePhabricator

Upgrade the link recommendation algorithm from Spark 2 to Spark 3.
Closed, ResolvedPublic

Description

The link recommendation algorithm currently uses Spark 2 in the model training pipeline.

The Data Engineering team is upgrading to Spark 3 and will no longer support Spark 2 after 31 March 2023.

The goal is to investigate and upgrade all instances where Spark 2 is used in the link recommendation algorithm.

Event Timeline

In my opinion, one of the main issues for the migration from spark2 to spark3 will be the following (there might be other issues though):
Currently, for the spark jobs the pipeline activates the anaconda-wmf environment (code). anaconda-wmf only supports spark2. thus we have to switch to conda-analytics. one of the problems I think we will be encountering is that we need some packages such as mwparserfromhell for the spark-jobs (see code). with the anaconda-wmf environment these are already installed and thus available on the spark-workers. in contrast, the conda-analytics environment is a minimal environment and doesnt contain that package by default so it will not be available on the spark-workers. therefore, we need to build a custom environment in which we install those dependencies (most importantly mwparserfromhell). we then have to ship that environment to the spark-workers when creating the spark-session (for example here). for this we can use the wmfdata-package which has an easy interface to start a spark-session and shipping the environment to the workers (code) by setting ship_python_env=True .

@MGerlach, thank you for sharing the documentation. Below are the steps I have followed to migrate the link-recommendation algorithm from spark 2 to 3.

These steps were run in my custom conda environment on the stat server. If you review and approve them, we'll move the conda env to a global space where everyone running the link-recommendation algorithm can use the custom conda env.

1.Checked and indeed the mwparserfromhell package doesn't exist in conda-analytics
source: on stat machine run commands below to see available packages.

$ /usr/lib/anaconda-wmf/bin/conda list -n base | grep ^m
$ /opt/conda-analytics/bin/conda list -n base | grep ^m

2.Created custom conda environment with base as conda-analytics then installed mwparserfromhell package
source:
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/conda-analytics#h-Creating_a_new_conda_user_environment-Usage
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/conda-analytics#h-Installing_packages_into_your_user_conda_environment

$ conda-analytics-clone link-recommendation-env
$ source conda-analytics-activate link-recommendation-env
$ export http_proxy=http://webproxy.eqiad.wmnet:8080
$ export https_proxy=http://webproxy.eqiad.wmnet:8080
$ pip install mwparserfromhell==0.5.4
$ conda deactivate
# confirm installation
$ /home/kevinbazira/.conda/envs/link-recommendation-env/bin/conda list | grep ^m
# confirm wmfdata exists
$ /home/kevinbazira/.conda/envs/link-recommendation-env/bin/conda list | grep ^w

3.Changed conda base environment from anaconda-wmf to the new conda-analytics clone /usr/lib/anaconda-wmf/bin/ with /home/kevinbazira/.conda/envs/link-recommendation-env/bin/
file:
https://github.com/wikimedia/research-mwaddlink/blob/main/run-pipeline.sh#L17-L22

4.Changed python version from python3.7 to python3.10
source: on stat machine run commands below to see python versions supported by anaconda-wmf and conda-analytics.

$ ls /usr/lib/anaconda-wmf/bin/ | grep ^python
$ ls /home/kevinbazira/.conda/envs/link-recommendation-env/bin/ | grep ^python

file:
https://github.com/wikimedia/research-mwaddlink/blob/main/run-pipeline.sh#L17-L22

5.Used wmfdata package to create spark session
files:
https://github.com/martingerlach/linkrec/blob/e212b26e17a9d991c071659d25abf1c1fa91a549/src/scripts/generate_wdproperties_spark.py#L33-L38
https://github.com/martingerlach/linkrec/blob/e212b26e17a9d991c071659d25abf1c1fa91a549/src/scripts/generate_anchor_dictionary_spark.py#L141-L146

from wmfdata.spark import create_session
...
# use wmfdata to create new Spark session
spark = create_session(
    type="yarn-regular",
    app_name="generating-anchors",
    extra_settings={},
    ship_python_env=True,
)

6.Changed spark commands from spark2-submit to spark3-submit
source:
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Spark#h-Command-line_interfaces
file:
https://github.com/wikimedia/research-mwaddlink/blob/main/run-pipeline.sh#L17-L22

7.Tested the training pipeline using WIKI_ID=xhwiki ./run-pipeline.sh

8.The pipeline run into a ValueError: unsupported pickle protocol: 5 issue as shown in the screenshot below:

Unsupported pickle protocol error - Screenshot from 2023-04-03 07-23-22.png (741×1 px, 283 KB)

Not sure whether you would like us to dump pickle files with a lower protocol (e.g 4) or load them with protocol 5 instead. @MGerlach, please let me know in case there is anything else we might need to change.

Otherwise, once the training pipeline succeeds, the next step will be to move the link-recommendation-env conda env from my user environment to a global space where everyone running the link-recommendation algorithm can use it.

The training pipeline succeeded after fixing code blocks (#1 and #2) that were spawning a different virtual env.

Training pipeline succeeded in new conda env - Screenshot from 2023-04-03 21-44-50.png (741×1 px, 140 KB)

Also installed requirements.txt packages with new ones:

line-profiler==4.0.3
marisa-trie==0.8.0
pandas==1.4.3
scikit-learn==1.2.2
wikipedia2vec==1.0.5
pyicu===2.10.2

@MGerlach, thank you for all the pointers you shared. Please let me know in case there is anything else we might need to change before moving the link-recommendation-env conda env from my user environment to a global space where everyone running the link-recommendation algorithm can use it.

Change 905552 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[research/mwaddlink@main] [WIP] Migrate from Spark 2 to 3

https://gerrit.wikimedia.org/r/905552

Change 905552 merged by jenkins-bot:

[research/mwaddlink@main] Migrate from Spark 2 to 3

https://gerrit.wikimedia.org/r/905552

Hello @kevinbazira. Data Engineering is tracking remaining Spark2 migrations, and so wanted to know the status of this one. From the conversation and Gerrit activity above, it looks like this task is done?