Page MenuHomePhabricator

Redirect info should be dumped at page level
Open, Needs TriagePublic

Description

MediaWiki Dumps 1.0 dumps redirect info at a page level, rather than a revision level.

MediaWiki Dumps 2.0 attempted to dump redirects at the revision level, but this approach has proven to be brittle.

We should modify the Dumps 2 content_history reconciliation enrichment streaming app so that, if the page ID of the event we want to enrich is a redirect, the redirect data is extended to all revisions.

AC

  • Dumps 2 sets redirect info at page level
  • Action API request for redirect (and content) info are performed only on update and create changelog changes.

Event Timeline

gmodena renamed this task from Redirect enriched info should be applied at page level to Redirect info should be dumped at page level.Nov 27 2024, 8:53 PM

Pasting here the investigation done for this:

Remains to be researched: What is the behavior of Dumps1 when filling the page_redirect_title field? Is it just using the page_id, or is it based on the revision_id ?

MediaWiki Dumps 1.0 dumps redirect info at a page level, rather than a revision level.

Code from MW Core at : https://github.com/wikimedia/mediawiki/blob/b8df8be4cd12db6e71357a2f7ba3eb1e3a88ff90/includes/export/XmlDumpWriter.php#L269

if ( $row->page_is_redirect ) {
			$services = MediaWikiServices::getInstance();
			$page = $services->getWikiPageFactory()->newFromTitle( $this->currentTitle );
			$redirectStore = $services->getRedirectStore();
			$redirect = $this->invokeLenient(
				static function () use ( $page, $redirectStore ) {
					return $redirectStore->getRedirectTarget( $page );
				},
				'Failed to get redirect target of page ' . $page->getId()
			);
			$redirect = Title::castFromLinkTarget( $redirect );
			if ( $redirect instanceof Title && $redirect->isValidRedirectTarget() ) {
				$out .= '    ';
				$out .= Xml::element( 'redirect', [ 'title' => self::canonicalTitle( $redirect ) ] );
				$out .= "\n";
			}
		}

Confirming that is indeed what we see in XML Dumps:

head -n 1000 simplewiki-20241020-stub-meta-current.xml | grep redirect -B 5 -A 5

...
<page>
    <title>American Units Of Measurement</title>
    <ns>0</ns>
    <id>41</id>
    <redirect title="United States customary units" />
    <revision>
      <id>4691442</id>
      <parentid>36394</parentid>
      <timestamp>2014-01-15T07:11:55Z</timestamp>
      <contributor>
        <username>AvicBot</username>
        <id>114482</id>
      </contributor>
      <minor/>
      <comment>Robot: Fixing double redirect to [[United States customary units]]</comment>
      <origin>4691442</origin>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="43" sha1="ge4o55xxsek0ma2fce8zr5429o9be0z" location="tt:4689925" id="4689925" />
      <sha1>ge4o55xxsek0ma2fce8zr5429o9be0z</sha1>
...

The fact that we see page_redirect_title on wmf.mediawiki_wikitext_history must then mean that we extend that redirect title towards all revisions associated to that page when importing the XML. That is, we denormalize the XML content, pushing the redirect title into the revisions.

And indeed that is the behavior we see on this example from simplewiki:

select revision_id, page_id, page_redirect_title from mediawiki_wikitext_history where page_id = 6311 and snapshot = '2024-10' and wiki_db = 'simplewiki' limit 10;
 revision_id | page_id | page_redirect_title 
-------------+---------+---------------------
      859066 |    6311 | Napoleon            
     1745327 |    6311 | Napoleon            
      691235 |    6311 | Napoleon            
     1517581 |    6311 | Napoleon            
     1302929 |    6311 | Napoleon            
     1748110 |    6311 | Napoleon            
     1752342 |    6311 | Napoleon            
      552361 |    6311 | Napoleon            
      667890 |    6311 | Napoleon            
      752288 |    6311 | Napoleon            
(10 rows)

Why do revisions with no redirect on mediawiki_wikitext_history have page_redirect_title set to "" (empty string)?

Because the import job defaults to empty String instead of NULL: https://github.com/wikimedia/analytics-refinery-source/blob/30a0c941147f5d65574a07a4b9f4fd0b4ab3cec5/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/mediawikidumps/MediawikiObjectsMapsFactory.scala#L112

Change #1098985 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/deployment-charts@master] dse-k8s-services: mw-dump: version bump image

https://gerrit.wikimedia.org/r/1098985

Change #1098985 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s-services: mw-dump: version bump image

https://gerrit.wikimedia.org/r/1098985

Next step: @xcollazo to rerun the pipelines from the beginning (~10 days).

Next step: @xcollazo to rerun the pipelines from the beginning (~10 days).

I cleared all 14 days of reconcile jobs at https://airflow-analytics.wikimedia.org/dags/dumps_reconcile_wikitext_raw_daily/grid?dag_run[…]ask_id=reconcile.spark_emit_reconcile_events_to_kafka

Each run takes ~90 min to go over all wikis, so should be done in ~21 hours.

Next step: @xcollazo to rerun the pipelines from the beginning (~10 days).

I cleared all 14 days of reconcile jobs at https://airflow-analytics.wikimedia.org/dags/dumps_reconcile_wikitext_raw_daily/grid?dag_run[…]ask_id=reconcile.spark_emit_reconcile_events_to_kafka

Each run takes ~90 min to go over all wikis, so should be done in ~21 hours.

Actually, I had to cancel this, clear, and restart it, because I forgot that, once the reconcile event is sent, we set reconcile_emit_dt and on a rerun we do not attempt to send it again unless we --force it, and that flag meant for manual reruns. Thus the easiest thing to do here is to mutate wikitext_inconsistent_rows_rc1 with:

UPDATE wmf_dumps.wikitext_inconsistent_rows_rc1
SET reconcile_emit_dt = NULL

And then clear and restart the rerun of Airflow jobs.

So I did that as analytics:

spark-sql (default)> UPDATE wmf_dumps.wikitext_inconsistent_rows_rc1
                   > SET reconcile_emit_dt = NULL;
Response code
Time taken: 187.312 seconds

Next step: @xcollazo to rerun the pipelines from the beginning (~10 days).

I cleared all 14 days of reconcile jobs at https://airflow-analytics.wikimedia.org/dags/dumps_reconcile_wikitext_raw_daily/grid?dag_run[…]ask_id=reconcile.spark_emit_reconcile_events_to_kafka

Each run takes ~90 min to go over all wikis, so should be done in ~21 hours.

Just to note that this is all done now. So we can verify whether the Flink app generated the correct events now.

Just to note that this is all done now. So we can verify whether the Flink app generated the correct events now.

I validated that a test sample (simplewiki) of reconciled+enriched records from the updated xcollazo.wikitext_raw_rc2_wiki_partitioned_plus_bf_on_rev_id have matching redirect info with Dumps 1 (via mediawiki_wikitext_current).