Page MenuHomePhabricator

Filter content metrics for canonical wikis
Closed, ResolvedPublic

Description

The knowledge gaps pipeline depends on a number of data sources to provide metrics per (wiki_db, gap, category, time). Some of these sources don't filter for canonical wikis, which results in non-canonical wikis to appear in the output datasets.

Specifically, it seems the culprit is in the page revision history logic, as only article_created and revision_count are non-null. The filter step needs to be added here.

    shows values for article_created and revision count
    spark.table("knowledge_gaps.by_category").where("wiki_db='srwikiquote'").select("metrics.*").distinct().show()
+---------------+-------------+--------------+----------------+----------------------+-------------+--------------+
|article_created|pageviews_sum|pageviews_mean|standard_quality|standard_quality_count|quality_score|revision_count|
+---------------+-------------+--------------+----------------+----------------------+-------------+--------------+
|              2|         null|          null|            null|                  null|         null|             6|
|              8|         null|          null|            null|                  null|         null|            13|

To filter for wikipedia projects, see Isaac's comment here.

Details

Due Date
Sep 12 2023, 4:00 AM