As a result of T315207: [SPIKE] Investigate bad section parsing (see Next steps section), we decided to pre-compute all badly parsed pages for ptwiki, which has an estimated 69% of them.
Tasks
- Create a maintenance script that identifies all ptwiki page revision IDs with a disagreement between mwparserfromhell and the Action API. The script will act independently from the data pipeline
- add an optional CLI argument to the data pipeline that takes a HDFS parquet of (wiki_db, revision IDs) rows and filters them out after the initial query that loads the wikitext dataframe