| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| Port mjolnir dag from airflow 1 | repos/data-engineering/airflow-dags!235 | ebernhardson | work/ebernhardson/mjolnir | main |
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Gehel | T322905 [EPIC] Upgrade Search Platform spark jobs to spark 3 | |||
| Resolved | Gehel | T318414 [Tracking] Migrate Search Airflow jobs to Airflow 2 and use shared supporting code from the data engineering Airflow | |||
| Resolved | EBernhardson | T329239 migrate mjolnir application and dag to airflow v2 and spark3 |
Event Timeline
The mjolnir repo has been migrated to gitlab
The dag has been migrated, currently test-running it in an airflow dev instance on stat1005 before making a merge request.
Next up is to work out the deployment process for search-loader daemons with the new conda based environment.
Test run has completed, looked reasonable but i needed to make a few adjustments to get things running. The training looks reasonable. Numbers below are the ndcg@10 value after training. For the test run I only ran against arwiki and plwiki since smaller wikis would finish in a reasonable time frame. Note that while these training against similar data it's not exactly the same, spark2 collected training data on feb 11th while spark3 used data from the 8th. Each run uses the prior 90 days of clicks.
To ensure the new training run was reasonable i looked at the number of observations we collected and the final trained ndcg@10. While the ndcg@10 is slightly higher in the spark3 versions I doubt that is meaningful. We are using the same xgboost version and the same feature selection, likely this is normal variance but i didn't check too closely. Overall these numbers are pretty close and suggest nothing broke in the conversion.
| wiki | version | num_obs | trained ncdg@10 |
| arwiki | 20230202 | 340234 | 0.821909 |
| arwiki | spark3 | 343651 | 0.854047 |
| plwiki | 20230202 | 463386 | 0.934262 |
| plwiki | spark3 | 461236 | 0.941541 |
I also spot checked the selected features to ensure they were similar, out of 50 features arwiki selected 47 that matched the most recent spark2 run. plwiki selected 38 out of 50 so i looked a little closer to make sure it was reasonable. The following shows the number of features in a set intersection. This suggests to me that variance here is normal and we are still in a normal space.
| 20230126 | 20230202 | spark3 | |
| 20230111 | 42 | 46 | 41 |
| 20230126 | - | 39 | 42 |
| 20230202 | - | - | 38 |
Next step I'm going to also run the training against enwiki to make sure we aren't getting significantly different resource usage (=executors dieing) on our largest training set. I'm also putting some finishing touches on the airflow-dags patch and will have that up for review in the next day or two.
Forgot to mention earlier, the library mjolnir uses for feature selection look to be abandoned, last updates to the github repos were in 2017. The updates necessary to move those libraries from spark 2 to 3 were very minor, mostly pom updates. Given the minimal complexity, I imported the repos into gitlab and updated them for spark 3. New releases were then published to our archiva. These are now found at:
Hello all. Am I right in my thinking that:
- when this task is finished you won't have any further need for python 3.7 on the Hadoop cluster
- until then, you do need it and you can't upgrade mjolnir to python 3.9
I'm just working around the various issues I'm discovering whilst upgrading the test Hadoop cluster to Bullseye (T329363) and this python 3.7 issue jumped out at me. Thanks.
That all sounds correct to me. The updated environment still uses 3.7, this was done to limit the number of changes being done all at once. But my understanding is that once this switches to the conda environment it will ship it's own python executables and will not need python3.7 installed by the system. If this creates a circular dependency of some sort between setting up the new airflow instance for search-platform we can stop the existing mjolnir dag and wait for the new one to be deployed with conda. The update is reasonably well tested by now and waiting on code review and an airflow 2 instance to exist. Missing a week or two of mjolnir runs won't hurt anything.
ebernhardson merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/235
Port mjolnir dag from airflow 1