Page MenuHomePhabricator

Mediawiki_wikitext_history job often has long gaps between stages
Closed, DuplicatePublic

Description

The mediawiki_wikitext_history job (Wikitech page, Airflow DAG, Scala code) frequently has a long gap between the end of the wait_for_pages_meta_history_xml_dump stage and the beginning of the convert_history_xml_to_parquet stage.This accounts for virtually all of the volatility in total job duration.

Gap size by run:

  • 2024-01-01: none
  • 2023-12-01: 2 d
  • 2023-11-01: 4 d
  • 2023-10-01 none
  • 2023-09-01: 1 d
  • 2023-08-01: none
  • 2023-07-01: none
  • 2023-06-01: 4 d
  • 2023-05-01: none
  • 2023-04-01: 1 d

As far as I know, this shouldn't happen. Eliminating it would significantly improve the speed of delivery of the movement metrics, which is the focus of the annual plan hypothesis SDS 2.6.2..

Event Timeline

Some research:

  • Each XML dumps snapshot may represent ~5.5TB (including ~1.8TB for wikidata and 1.4TB for enwiki)
  • The Airflow sensor may take ~19days to turn green. It waits until the last dump has been processed (_IMPORTED flag). Most dumps are generated in a matter of days (~4 on average, maybe). Enwiki may take 7 days. And they all wait for the wikidata dump (~19 days).
  • When the sensor turns green, a heavy Spark job is launched to convert all the compressed XML to parquet. ~5.5TB (compressed) is taking ~4.5 days to process.
  • The perceived gaps are due to the non-parallelism of the dag + very long jobs. 1 heavy job is preventing the other ones from running. due to the retries (Thx for the pointer @JAllemandou ). Other symptoms, same problem here I think: https://phabricator.wikimedia.org/T342911

Regarding load distribution, the pipeline could be improved if we accept some tradeoffs depending on how it's used downstream. For example, we may generate 1 sensor per dump and then 1 spark job per dump. It may smooth the load on the cluster for the ~99% of small wikis. And downstream jobs may be triggered when enwiki is finished, probably at 7+4.5*1.4/5.5 = 8.1 days. Then wikidata would be processed later without blocking the chain.

Thank you, @Antoine_Quhen!

At a meeting yesterday, we noted the following:

  • We want the job not to retry the conversion stage, because the retries have generated duplicate data (T342911). Future runs might require manual retries, but it's also possible that Airflow was incorrectly detecting stage failures and the change will allow the first attempt to succeed.
  • The job could potentially be further sped up by parallelizing it with per-wiki jobs or stages. However, this might actually increase the risk of problems, since the job is I/O intensive and trying to do two large wikis at once could put too much pressure on HDFS. This might be alleviated if we set up per-wiki sensors so the differing arrival times of the dumps would naturally space out the conversion jobs. In any event, this would be significantly more work than T357859 with less potential speed improvement.

So, for now, it doesn't make sense to pursue any improvements to the mediawiki_wikitext_history job other than fixing the retry/apparent stage failure issue.

I will merge this into T342911 since that's the cause of the gaps I noticed.