Page MenuHomePhabricator

Wikidata Concepts Monitor ETL Migration to Spark3
Closed, ResolvedPublic

Description

Documentation: Wikidata Concepts Monitor ETL job.

The 'Wikidata Concept Monitor ETL' job was created by Goran Milovanovic a few years ago and produces results originally visible here (the site is broken).

The job is scheduled to run regularly on our hadoop cluster and uses spark with R.

We are currently migrating all our jobs from spark version 2 to spark version 3, and this job still uses spark version 2.

In this task we should:

  • Figure out the components which need to be migrated to Spark 3
  • Do the migration

Event Timeline

Is @GoranSMilovanovic available to help steward the migration and test that the output is as expected?

Is there any chance that we could get the https://wdcm.wmflabs.org/ site working again before we do the migration, so that we may more easily test results? Or is there not really any value in that for us?

@BTullis With all the good will I have to help with this, there's no chance I'll be able to help before June 2023.

@Manuel can give a more up to date rundown of our plans for all this. I'll be working on the migration with him :)

Thanks both for the input. Let me know if I can help at all.

Hi @BTullis, thank you for your offer, we might take you up on that!

@AndrewTavis_WMDE is our newly hired Data Analyst for Wikidata. The plan is that he will mainly work on this with support from @xcollazo (WMF), @ItamarWMDE (Staff Engineer for Wikidata), and me (Analytics Product Manager for Wikidata). We plan to first evaluate the situation of the WDCM in June. Ideally, we would start the migration only based on that evaluation.

Would that fit your plans, or is there already a risk of losing data by then?

@GoranSMilovanovic: I am very aware that you are only working on this as a volunteer. So, no worries, we will try to solve this as well as we can, and only ask you about stuff where we are lost otherwise. And thank you for staying involved!

@Manuel It is not that I am very much involved, but the professional situation with me is simply as it is: I can barely find any time besides the responsibilities that I carry. However, I will really make an effort in the end of May to clear up some space for our work in June. Objective constraints - that is all that I can offer as an excuse.

No excuse is needed whatsoever! Wikimedia is now responsible for the WDCM, and we will deal with this. There is no question about it. And I am grateful that we can still ask you questions in case we get lost. 🙏

@AndrewTavis_WMDE is our newly hired Data Analyst for Wikidata. The plan is that he will mainly work on this with support from @xcollazo (WMF), @ItamarWMDE (Staff Engineer for Wikidata), and me (Analytics Product Manager for Wikidata). We plan to first evaluate the situation of the WDCM in June. Ideally, we would start the migration only based on that evaluation.

Would that fit your plans, or is there already a risk of losing data by then?

Great. I don't believe that there is any risk of losing data by then.
Spark2 will no longer be available when the Hadoop cluster is upgraded to Debian bullseye, but that's a little way off yet.

It might also be relevant to look at those parts of the job that currently run on the stats servers. By the look of it that's stat1004 and stat1007.
Both of these run Debian buster. We're just starting to bring in bullseye based stats servers: e.g. T336036: Bring stat1009 into service and T336040: Bring stat1010 into service with GPU from stat1005 at which point we will start on decommissioning some buster based stats servers and upgrading those that remain.

I'm not sure what plans @xcollazo has for the migration either, so it could be that these parts of the process are migrated away from the stats servers to an airflow based pipeline.
Anyway, I'm happy to try to help if I can, but Xabriel almost certainly knows more about the plan than I do.

I'm not sure what plans @xcollazo has for the migration either, so it could be that these parts of the process are migrated away from the stats servers to an airflow based pipeline.
Anyway, I'm happy to try to help if I can, but Xabriel almost certainly knows more about the plan than I do.

When I first looked at this, I definitely thought that the code that runs on the stat servers could benefit from moving to Airflow, since we are discouraging folks from doing production work on stat machines. But I hesitated to put it as a target for this task because it is not strictly necessary for moving this codebase to Spark3.

Having said that, I am super happy to help on that front if folks are interested to get this codebase setup for the long run.

(Switching ownership to reflect @Manuel's comment above. )

Thanks all for your willingness to help! We'll be in touch in June once we've had the initial meetings with @ItamarWMDE. Those are planned for the 13th and 14th :)

Hi Folks - What is the status on this one?

I'd like Data-Engineering to announce the deprecation of Spark2 for this end of month, but not without knowing how we plan on tackling your job :)
Here are the 2 possible solutions I can think of:

  • Stopping the job while it is revamped to spark3 (Knowing that the dashboard is broken, is it a possible solution?)
  • Configure the job not to use DynamicAllocation but to use fixed-resource, making the job work in spark2 despite spark2 being deprecated, but using more cluster resources than really needed
  • Postpone deprecating spark2 (if we could not do that, I'd be super happy :)

Let me know your thoughts :)

Thanks for reaching out, @JAllemandou!

We're making progress in getting local copies of the dashboards up and running. @Manuel and I will be discussing them on Wednesday and can get back to you all then :)

Hope all had a nice weekend!

I'll async with him now and see if we can come to a decision sooner than that, but you all will have the answer by Wednesday at the latest 😊

I'll async with him now and see if we can come to a decision sooner than that, but you all will have the answer by Wednesday at the latest 😊

Awesome, thank you :)

Checking in with you all on this:

The big question that @Manuel and I have is whether or not this process is currently generating some time series data that would be of particular value and then would be lost and not able to be recreated from the data lake. I have been able to get the pageviews per namespace time series dashboard up and running locally, but that has not updated since February. That and other data that's being produced by this process, which can be found here, seems to just be current values and not time series (except for propertypairs, which we're investigating further). Do you all have any knowledge of some time series tables within the data lake that are being updated by this process that would then be broken by us stopping the job?

A broader question: does anyone at WMF know of tables in the data lake that Wikimedia Deutschland created/maintained in the past that are still being updated? Would be great to be pointed towards them so we can investigate further :)

Our thoughts on the question at hand:

  • If the answer to the above question of permanently losing some data that's being produced by Concepts Monitor and other WMDE jobs is no, then we're ok with option one above of stopping the job.
  • Aside from this we'd prefer option two of configuring it to use fixed-resource.
  • If the answer to the above question of permanently losing some data that's being produced by Concepts Monitor and other WMDE jobs is no, then we're ok with option one above of stopping the job.

I am not knowledgeable at all about the data generated by the job unfortunately, preventing me to assess whether there is data generated by the job that we would not be able to regenerate.
Also, I have not been told about intermediary data stored on the cluster, making me think that all the data generated by the job is small enough to be saved for the reports only.
But as stated befoe, those are uninformed ideas :(

  • Aside from this we'd prefer option two of configuring it to use fixed-resource.

We can test that :)

We can test that :)

@JAllemandou, not sure what the tests entail, but feel free to look into it and please let us know what the results are 😊 As long as the tests turn out ok and it's not too much of a bother, then we're fine with this and going with option two to use fixed-resource for now :)

Hi @AndrewTavis_WMDE,
I've done some investigation, and here is what I have: Goran has 11 CRON jobs running from various hosts on our system (1on stat1004, 2 on stat1007, 7 on stat1008).

  • WDCM_Sqoop_Clients runs on`stat1004` weekly - It doesn't run spark (but Sqoop)
  • 2021_WMDE_Mitmachen_Bereich_2021_Campaign runs on stat1007 daily - It doesn't run spark (but Hive)
  • WD_PageviewsPerType runs on stat1007 daily but has been failing since February 17th - It runs a spark job
  • WD_UsageCoverage runs on stat1008 daily - It runs a spark job
  • WD_languagesLandscape runs on stat1008 monthly (30th of the month) - It runs a spark job
  • Wiktionary_CognateDashboard runs on stat1008 daily - It doesn't run spark
  • WDCM_EngineBiases runs on stat1008 weekly - It runs a spark job
  • Qurator_CuriousFacts runs on stat1008 monthly (10th of the month) - It runs a spark job
  • WMDE_BannerImpressions runs on stat1008 hourly - It doesn't runspark (but Hive)
  • NewEditors_comprehensive_report runs on stat1008 daily - It runs a spark job

We need to meet and talk about your usage of the data generated by those scripts, and see what you wish us to try to make work versus stop.
I'm booking some time on your calendar next Monday :)

Hey @JAllemandou!

Thanks for all your efforts to find these jobs! Really appreciate it 😊 Has been a bit difficult to figure out what the infrastructure we have everywhere is. I'll update T340718 with the information you posted above.

Thanks for booking the time! Accepted the meeting and also invited @Manuel :)

Hope you have a nice weekend!

We met this morning with @AndrewTavis_WMDE and @Manuel - Thank you folks for the great meeting.
The detailed Meeting notes are here: https://docs.google.com/document/d/1REsolXnZf2KqApL0p-DE8X4eWXI_zxHgrCe3k1hcZnw

From the job list in previous comment:

  • 4 don't run spark andare kept as-is: WMDE_BannerImpressions, Wiktionary_CognateDashboard, 2021_WMDE_Mitmachen_Bereich_2021_Campaign, WDCM_Sqoop_Clients)
  • 3 are stopped (crontaab commented): Qurator_CuriousFacts, WDCM_EngineBiases, WD_PageviewsPerType
  • 3 have been updated to run spark2 in fixed-resource mode, thus normally not failing after the migration to the spark3-shuffler: WD_UsageCoverage, WD_languagesLandscape, NewEditors_comprehensive_report

With those changes there is no more blocker in migrating to the spark3-shuffler from this task :)

With those changes there is no more blocker in migrating to the spark3-shuffler from this task :)

\o/

Thank you again for your super helpful support on this, @JAllemandou!