Page MenuHomePhabricator

Many revision events in mediawiki_history have missing page and namespace information
Closed, ResolvedPublic13 Estimated Story Points

Description

While investigating T218819, I found that many revision events in mediawiki_history are lacking page and namespace information in revision event. The effect is major: when calculating global monthly active editors (where only content edits are counted), this bug reduced the number by as much as 10 000 per month.

There seem to be at least three different problems at work:

Deleted pages where all the revisions have null page_namespace_historical, page_namespace_is_content and page_namespace_is_content_historical:

  • nlwiki page 5160672
  • commonswiki page 71797356
  • wikidatawiki page 9637937
  • arwiki page 4970274
  • commonswiki page 73916373
  • ruwiki page 7054391

Revisions with null page_namespace_historical and page_namespace_is_content_historical. Most but not all of the revisions to the pages concerned are affected.

Revisions where the join to the page table seems to have failed entirely, because they have have null page_title, page_namespace, and page_namespace_is_content (including historical version) and page_creation_timestamp. Most but not all of the revisions to the pages concerned are affected.

Event Timeline

fdans moved this task from Incoming to Data Quality on the Analytics board.

@Nuria, @JAllemandou, this is the most significant issue that I identified in T218819 (T218463 is also major, but you've already fixed it 👏).

It would be really useful if this could be fixed in the April snapshot, since this is impeding our ability to accurately calculate our board metrics from the Data Lake (cc @kzimmerman)

It would be really useful if this could be fixed in the April snapshot,

@JAllemandou is going to organize under a parent task issues fixed on April snapshot, I do not think this will be one of them but other fixes might have mitigated this problem , let's look at it again once April snapshot is out.

when calculating global monthly active editors (where only content edits are counted), this bug reduced the number by as much as 10 000 per month.

@Neil_P._Quinn_WMF Can you quantify this a bit more?Is this 10,000 over about 100,000 editors (10% difference) seems quite a lot, we have been doing some work on vetting metrics per family and while we have done edits and not active editors numbers do not differ that much (this might be non relevant as I certainly can see how edits could be similar but not active editors if issues affect distribution of edits per editor). See: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2/Vetting_data_lake_metrics_for_project_families

I am puzzled on how edits are within couple percentages but active editors are so different, mmm.. this does not add up. Can @Neil_P._Quinn_WMF write the selects he used on this ticket?

Please see vetting of active editors per wiki here: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2/Data_Quality

Thanks for offering @Neil_P._Quinn_WMF :)
I'm still working on changing the algorithm, so no need from you as of now.
I'll let you know once I have a test dataset.

This is solved from snapshot 2019-05 onward thanks to the rebuild of the page-history reconstruction algorithm:

spark.sql(s"""
SELECT
  revision_is_deleted_by_page_deletion as rev_deleted,
  page_is_deleted as page_deleted,
  page_title IS NULL as null_title,
  page_namespace IS NULL as null_namespace,
  page_namespace_is_content IS NULL as null_is_content_namespace,
  page_title_historical IS NULL as null_historical_title,
  page_namespace_historical IS NULL as null_historical_namespace,
  page_namespace_is_content_historical IS NULL as null_historical_is_content_namespace,
  COUNT(1) as c
FROM wmf.mediawiki_history
WHERE snapshot = '2019-05'
  AND event_entity = 'revision'
GROUP BY
  revision_is_deleted_by_page_deletion,
  page_is_deleted,
  page_title IS NULL,
  page_namespace IS NULL,
  page_namespace_is_content IS NULL,
  page_title_historical IS NULL,
  page_namespace_historical IS NULL,
  page_namespace_is_content_historical IS NULL
ORDER BY
  rev_deleted,
  page_deleted,
  null_title,
  null_namespace,
  null_is_content_namespace,
  null_historical_title,
  null_historical_namespace,
  null_historical_is_content_namespace
""").show(1000, false)


+-----------+------------+----------+--------------+-------------------------+---------------------+-------------------------+------------------------------------+----------+
|rev_deleted|page_deleted|null_title|null_namespace|null_is_content_namespace|null_historical_title|null_historical_namespace|null_historical_is_content_namespace|c         |
+-----------+------------+----------+--------------+-------------------------+---------------------+-------------------------+------------------------------------+----------+
|false      |null        |true      |true          |true                     |true                 |true                     |true                                |1457740   |
|false      |false       |false     |false         |false                    |false                |false                    |false                               |4017220948|
|false      |true        |false     |false         |false                    |false                |false                    |false                               |98330     |
|true       |null        |true      |true          |true                     |false                |false                    |false                               |48530106  |
|true       |null        |true      |true          |true                     |false                |false                    |true                                |414283    |
|true       |false       |false     |false         |false                    |false                |false                    |false                               |195863    |
|true       |true        |false     |false         |false                    |false                |false                    |false                               |157697806 |
+-----------+------------+----------+--------------+-------------------------+---------------------+-------------------------+------------------------------------+----------+

In words:

  • 95.07% of all revisions are not deleted, don't belong to a deleted page, and have all their page-information set for current and historical.
  • 3.73% of all revisions are deleted, belonging to a deleted page (makes sense!), and also have all their current and historical page-info set.
  • 1.15% of all revisions are deleted but are not linked to a deleted page (I think they mostly belong to a deleted page we have not yet managed to rebuild). Those revisions don't have current-page information, but they ave their historical one.
  • 0.03% of all revisions are not deleted and are completely unlinked to any page, having no page information (the few ones I checked were orphan - meaning there was no page with their page_id in the page table).
  • 0.01% of all revisions are deleted, not linked to their page, and have undefined historical_is_content_namespace - Due to namespaces having changed over time.
  • 0.005% of all revisions are deleted, but belong to a non-deleted page (possible through delete and partial restore after page-id recycling). Their page info is set, both current and historical.
  • 0.002% of all revisions are not deleted but are linked to deleted page (super weird case - probably corrupted revisions). Their page info is set, both current and historical.

@Neil_P._Quinn_WMF - I let you play with the data and close the taks if it's ok :)

Nuria set the point value for this task to 13.