Page MenuHomePhabricator

Missing/inconsistent page_redirect_target field for redirects in Mediawiki content current v1 dumps
Open, Needs TriagePublic

Description

Trying to find all non-redirect pages in the Mediawiki content current v1 dumps, I came across two potential bugs:

  • entry in page_redirect_target for non-redirect pages is inconsistent, either "" or null. For enwiki and main namespace it is ~1.6M and ~6.4M articles, respectively.
  • a large number of redirect pages do not have the correct entry in page_redirect_target. For enwiki and main namespace, ~900K redirects have page_redirect_target==null.

Checking whether the content contains the string "#REDIRECT" is not a sufficient solution since wikis can define different aliases for redirects; e.g., dewiki also uses "Weiterleitung" (see T232123#5499245). As a result, we would likely miss many redirects with this approach. Thus, having the correct entry for page_redirect_target is crucial. In the older Mediawiki wikitext current dump, redirects could be easily identified without looking into the content via page_redirect_title==""

Example query:

spark.read.table("wmf_content.mediawiki_content_current_v1")
.where(F.col("wiki_id")=="enwiki")
.where(F.col("page_namespace_id")==0)
.where(F.col("page_redirect_target").isNull())
.where(F.col("revision_content_slots.main.content_body").contains("#REDIRECT"))
.select("page_id","page_title","page_redirect_target","revision_content_slots.main.content_body")

Examples:

+--------+--------------------+--------------------+--------------------+
| page_id|          page_title|page_redirect_target|        content_body|
+--------+--------------------+--------------------+--------------------+
|58416114|Kabaddi_at_the_20...|                null|#REDIRECT [[:Kaba...|
|18603651|2008_Twenty20_Cup...|                null|#REDIRECT [[2008 ...|
|18603788|2008_Twenty20_Cup...|                null|#REDIRECT [[2008 ...|
|58428335|           Corneille|                null|#REDIRECT [[Pierr...|
|58472190|Deliathis_impluvi...|                null|#REDIRECT [[Delia...|
|  177868|     Blizzard_Of_Ozz|                null|#REDIRECT [[Blizz...|
|  179790|           Pay_phone|                null|#REDIRECT [[Payph...|
|  190700|           9_January|                null|#REDIRECT [[Janua...|
|16573528|     Deportivo_Moron|                null|#REDIRECT [[Depor...|
|16595958| Moron,_Buenos_Aires|                null|#REDIRECT [[Morón...|
| 3424386|     Battle_of_Chile|                null|#REDIRECT [[Chile...|
| 3451905|List_of_assets_ow...|                null|#REDIRECT [[List ...|
|16782653|Dakotadon_lakotae...|                null|#REDIRECT [[Dakot...|
|68067749|    Krishnanda_Singh|                null|#REDIRECT [[Krish...|
|68170835|          D.S._Cohen|                null|#REDIRECT [[List ...|
| 6743586| Western_Los_Angeles|                null|#REDIRECT [[Wests...|
|63377586|        3_Amigos_250|                null|#REDIRECT [[ARCA ...|
|28244023|Bremmer_(disambig...|                null|#REDIRECT [[Bremm...|
|28244053|Bremer_(disambigu...|                null|#REDIRECT [[Breme...|
| 6892337|Trinomys_mirapitanga|                null|#REDIRECT [[Dark-...|
+--------+--------------------+--------------------+--------------------+
only showing top 20 rows

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
add redirect checksrepos/data-engineering/mediawiki-content-pipelines!104a-pizzataT417596-redirect-pagesmain
Fix page_redirect_target to ingest from the correct stream element.repos/data-engineering/mediawiki-content-pipelines!82xcollazofix-redirectmain
Customize query in GitLab

Event Timeline

It looks like this bug has been there since we put the new tables in production, as older, backfilled data does not seem to be affected:

spark.sql("""
SELECT
  count(1) as count,
  page_redirect_target,
  DATE_TRUNC("month", revision_dt) as d
FROM wmf_content.mediawiki_content_current_v1
WHERE wiki_id = 'enwiki'
  AND page_namespace_id = 0
  AND (page_redirect_target IS NULL OR page_redirect_target = "")
  AND startswith(revision_content_slots['main'].content_body, '#REDIRECT')
GROUP BY d, page_redirect_target
ORDER BY d DESC
""").show(truncate=True)
[Stage 34:=====================================================>(517 + 1) / 518]
+------+--------------------+-------------------+
| count|page_redirect_target|                  d|
+------+--------------------+-------------------+
| 31235|                null|2025-10-01 00:00:00|
|153886|                null|2025-09-01 00:00:00|
| 88680|                null|2025-08-01 00:00:00|
| 70961|                null|2025-07-01 00:00:00|
| 67738|                null|2025-06-01 00:00:00|
| 68260|                null|2025-05-01 00:00:00|
| 79610|                null|2025-04-01 00:00:00|
| 71966|                null|2025-03-01 00:00:00|
| 61984|                null|2025-02-01 00:00:00|
| 57733|                null|2025-01-01 00:00:00|
| 69996|                null|2024-12-01 00:00:00|
| 60081|                null|2024-11-01 00:00:00|
| 73186|                null|2024-10-01 00:00:00|
| 60557|                null|2024-09-01 00:00:00|
| 62763|                null|2024-08-01 00:00:00|
| 49233|                null|2024-07-01 00:00:00|
|     1|                    |2021-02-01 00:00:00|
|     1|                    |2021-01-01 00:00:00|
|     1|                    |2008-05-01 00:00:00|
+------+--------------------+-------------------+

Change #1197648 had a related patch set uploaded (by Aqu; author: AQ):

[mediawiki/extensions/EventBus@master] Draft: Fix missing created_redirect_page

https://gerrit.wikimedia.org/r/1197648

Hm @Antoine_Quhen ! reading Draft: Fix missing created_redirect_page (1197648). IIUC, this means that we have never properly set the created_redirect_page.redirect_page_link field in events, yes?

So currently, on page moves we set:

# info about the page_id that was moved.
page 

#if the page that was moved itself is a redirect, info about the page it points to (which in most cases is probably not true on page moves, unless the page_id being moved is already a redirect)
page.redirect_page_link 

# info about the page that was created and left behind at the old page title as a redirect
created_redirect_page

But we do not properly set

# into about the page that the newly created redirect page points to (which should === page)
created_redirect_page.redirect_page_link

Correct?

If we want to add created_redirect_page.redirect_page_link, we'll need to add it to the schema. Here is how redirect_page_link is added to page object. We'd have to do the same for created_redirect_page here.

If we do want to add this, we may want to consider just adding redirect_page_link to the main page entity schema, so that all representations of a page have the ability to include and model the page they are redirecting to.

However...are we sure we want to do this? Where does the chain of pages inside of pages stop? page_link entity is a page model with an extra field. In the cases of redirect chains, Do we intend to also set e.g. created_redirect_page.redirect_page_link.redirect_page_link.redirect_page_link? I think not!

In the case of page moves, created_redirect_page.redirect_page_link.page_id should === page.page_id. I don't believe (?) that pages can be 'moved' to other wikis, so the interwiki_prefix field in page_link is not useful here. So the page object should have exactly the same info that you are trying to add in created_redirect_page.redirect_page_link.

Perhaps mw_content_history pipeline just needs to account for that fact and use page as the created_redirect_page redirect target?

Perhaps, what we need in EventBus (or maybe even upstream in the PageMoveEvent) is an assertion that $createdRedirectPageTarget == $event->getPageRecordAfter()?

I was looking for past discussion when we created the page_link schema. We did consider adding to entity page schema: T325315#8898609, but I don't see why we chose not to at the time. More discussion is also T331399: Create new mediawiki links change streams based on fragment/mediawiki/state/change/page.

I don't believe (?) that pages can be 'moved' to other wikis, so the interwiki_prefix field in page_link is not useful here

I am still not sure about page moves, but from reading those ^ tickets Daniel did say that it is possible for a redirect to link to another wiki's page. I'd guess that that would be a manually added redirect link though, not one that MW would automatically create for its created_redirect_page.

@Antoine_Quhen, @Ottomata and I met to discuss this ticket.

Although the issue discussed at T400632#11293892 could potentially affect future users of the page_change stream, it does not seem to be the culprit of this ticket.

Instead, we figured that the ingestion code at process_events.py#L206 is the issue. We misinterpreted the purpose of the page_change event created_redirect_page, and we should instead be ingesting page.redirect_page_link.page_title.

Change #1197648 abandoned by Aqu:

[mediawiki/extensions/EventBus@master] Draft: Fix missing created_redirect_page

Reason:

T400632

https://gerrit.wikimedia.org/r/1197648

To recap, this fix we merged should fix this issue from Oct 22 and on:

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/82

Fix page_redirect_target to ingest from the correct stream element.

However, we still need to fix data from before Oct 22.

We can fix older data by either:

  • Running a query like that in T400632#11273972 to identify all offending rows and manually add them to the inconsistencies table.
  • Or, we could modify the reconcile algorithm to detect these cases.

To recap, this fix we merged should fix this issue from Oct 22 and on:

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/82

Fix page_redirect_target to ingest from the correct stream element.

However, we still need to fix data from before Oct 22.

We can fix older data by either:

  • Running a query like that in T400632#11273972 to identify all offending rows and manually add them to the inconsistencies table.
  • Or, we could modify the reconcile algorithm to detect these cases.

On second though, as per the description of this issue, running a query to detect the offending rows would be brittle considering we would have to know all the possible REDIRECT translations.

Let's consider wether modifying the reconcile algorithm could work.

Additionally, if consistency is desired, we can UPDATE all old data to have NULL in page_redirect_target instead of the empty string.

Over at T410405 we did an UPDATE to all our content tables so that page_redirect_target is consistently NULL instead of a mix of NULLs and empty string when it is not defined.

Now we still need to do something to reconcile pages that are meant to have page_redirect_target set but do not.

One idea that @APizzata-WMF and I had discussed was to consider the page table from the MariaDB replicas when doing a reconcile. We though that, for daily reconcile, we could just pull the neccesary rows by leveraging the page_touched timestamp-like column. Unfortunately, what we missed in that conversation is that this column is not indexed, thus it would still incur a full table scan to be able to filter on it:

mysql:research@dbstore1008.eqiad.wmnet [enwiki]> SELECT index_name, column_name, seq_in_index FROM INFORM
ATION_SCHEMA.STATISTICS WHERE TABLE_NAME = 'page';
+-----------------------------+------------------+--------------+
| index_name                  | column_name      | seq_in_index |
+-----------------------------+------------------+--------------+
| PRIMARY                     | page_id          |            1 |
| page_name_title             | page_namespace   |            1 |
| page_name_title             | page_title       |            2 |
| page_random                 | page_random      |            1 |
| page_len                    | page_len         |            1 |
| page_redirect_namespace_len | page_is_redirect |            1 |
| page_redirect_namespace_len | page_namespace   |            2 |
| page_redirect_namespace_len | page_len         |            3 |
+-----------------------------+------------------+--------------+
8 rows in set (0.001 sec)

Discussing with @JAllemandou, he suggests that perhaps we need to diverge the daily and monthly reconciles, and to consider heavier reconcile issues like this only on the monthly. On monthly run, we can also wait for the sqooped version of the page table, and leverage that.

Will discuss this further with @APizzata-WMF .

Will discuss this further with @APizzata-WMF .

After discussing with @xcollazo we are pushing out a fix that uses the redirect table of MariaDB, here is the MR.