Research Engineering to create a on-off dataset of HTML diff dataset for
- for 6 larger wikis (including enwiki), for 1 month of revisions. ~10-15M html diffs
- as html needs to be rendered by parsoid, going to the API for older revisions (that are not cached) is expensive (as opposed to getting older wikitext which is just a databases lookup). Need to coordinate with SRE to ensure we do this safely.
Context: While we are waiting for an production html dataset (T380874), an urgent and strategically important need has appeared that cannot wait for that task. As a result, we now need to create a one-off HTML data-set as quickly as possible.