Page MenuHomePhabricator

Pre-compute bad section parsing for ptwiki
Closed, ResolvedPublic

Description

As a result of T315207: [SPIKE] Investigate bad section parsing (see Next steps section), we decided to pre-compute all badly parsed pages for ptwiki, which has an estimated 69% of them.

Tasks

  • Create a maintenance script that identifies all ptwiki page revision IDs with a disagreement between mwparserfromhell and the Action API. The script will act independently from the data pipeline
  • add an optional CLI argument to the data pipeline that takes a HDFS parquet of (wiki_db, revision IDs) rows and filters them out after the initial query that loads the wikitext dataframe

Event Timeline

mfossati changed the task status from Open to In Progress.Nov 22 2022, 10:20 AM
mfossati claimed this task.

Merged, closing.