Page MenuHomePhabricator

[Analytics] Monitoring Pageviews of Specific Wikidata Integration Help Pages
Closed, ResolvedPublic

Description

Wikidata Analytics Request

This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create tasks for the team. Thank you!

Purpose

Please provide as much context as possible as well as what the produced insights or services will be used for.

As part of the Wikidata Integrations Team's annual goals, we would like to track the awareness of Wikidata Integration by monitoring the pages of particular help pages hosted on Wikidata. These types of help pages provide useful information for editors who want to use WIkidata's data in their projects. An example of such a page includes the recently revamped Help Sitelinks page. We want to check if improving the information on the pages attracts more users.

Specific Results

Please detail the specific results that the task should deliver.

A pipeline for this directly such that the data is saved on a regular basis e.g. on a monthly or quarterly basis. The pages we are concerned about are:

Desired Outputs

Please list the desired outputs of this task.

February 2025

  • total_help_items_views: 47,017
  • total_help_sitelinks_views: 6,255
  • total_help_edit_summary_view: 76
  • total_wikidata_wiwp_views: 469
  • total_wikidata_how_to_use_wd_views: 3,679
  • Total from summing the above: 57,496

March 2025

  • total_help_items_views: 12,557
  • total_help_sitelinks_views: 4,076
  • total_help_edit_summary_view: 70
  • total_wikidata_wiwp_views: 530
  • total_wikidata_how_to_use_wd_views: 1,974
  • Total from summing the above: 19,207
NOTE: The process of adding in a DAG will be added to this task - see below:
  • A DAG to collect these metrics on a monthly basis
    • Populates the table wmde.wit_docs_pageview_metrics_monthly
  • The data from the original task being added to the new table

Deadline

Please make the time sensitivity of this request clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

09.04.2025


Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

February and March Numbers

  • Explore potentially useful tables (DataHub)
  • Derive method of getting views for specific pages
  • Get totals per page and aggregate for February and March

Monthly DAG

  • Write Iceberg table create table script
  • Create a Iceberg table in Hive/HDFS within the wmde namespace
  • Convert monthly stats query to production Airflow query
  • Generate testing table generation and query scripts
  • Write DAG to run job query
  • Write DAG tests
  • Run tests on process as possible
  • Deploy DAG

Adding Old Data

  • Add data from notebook and wmf_deprecated.webrequest to the new Iceberg table

Estimation

Estimate - Feb/Mar numbers: 1/2 a day
Actual - Feb/Mar numbers: 1/2 a day
Estimate - DAG: 1/2 a day
Actual - DAG: 1/2 a day

Data

The tables that will be referenced in this task.

  • wmf.webrequest
  • wmf_deprecated.webrequest

Notes

Things that came up during the completion of this task, questions to be answered and follow-up tasks.

NOTE: We needed to use wmf_deprecated.webrequest for this task as the data for February and March isn't in wmf.webrequest because of the HAProxy migration.

Event Timeline

The query for this can be found below. The value for month = 2 in WITH wdi_doc_page_view_flags AS (...) needs to be changed to get the values from other months.

WITH wdi_doc_page_view_flags AS (
    SELECT
        CASE
            WHEN pageview_info.page_title LIKE 'Help:Items%'
            THEN 1
            ELSE 0
        END AS help_items,

        CASE
            WHEN pageview_info.page_title LIKE 'Help:Sitelinks%'
            THEN 1
            ELSE 0
        END AS help_sitelinks,

        CASE
            WHEN pageview_info.page_title LIKE 'Help:Edit_summary%'
            THEN 1
            ELSE 0
        END AS help_edit_summary,

        CASE
            WHEN pageview_info.page_title LIKE 'Wikidata:Wikidata_in_Wikimedia_projects%'
            THEN 1
            ELSE 0 
        END AS wikidata_wiwp,

        CASE
            WHEN pageview_info.page_title LIKE 'Wikidata:How_to_use_data_on_Wikimedia_projects%'
            THEN 1
            ELSE 0
        END AS wikidata_how_to_use_wd

    FROM
        wmf_deprecated.webrequest
    
    WHERE
        year = 2025
        AND month = 2
        AND webrequest_source = 'text'
        AND uri_host RLIKE '^(?!test).*wikidata.*'
        AND is_pageview = TRUE
        AND (
            pageview_info.page_title LIKE 'Help:Items%'
            OR pageview_info.page_title LIKE 'Help:Sitelinks%'
            OR pageview_info.page_title LIKE 'Help:Edit_summary%'
            OR pageview_info.page_title LIKE 'Wikidata:Wikidata_in_Wikimedia_projects%'
            OR pageview_info.page_title LIKE 'Wikidata:How_to_use_data_on_Wikimedia_projects%'
        )
)

SELECT
    sum(help_items) AS total_help_items_views,
    sum(help_sitelinks) AS total_help_sitelinks_views,
    sum(help_edit_summary) AS total_help_edit_summary_views,
    sum(wikidata_wiwp) AS total_wikidata_wiwp_views,
    sum(wikidata_how_to_use_wd) AS total_wikidata_how_to_use_wd_views
    
FROM
    wdi_doc_page_view_flags
;

Leaving this task as In Progress as I'd suggest we do the DAG for this early next week so the process is finished up :) All work for the February and March values is done though. I really am not sure why we had such a dip in March. Yes we're using wmf_deprecated.webrequest as the March numbers in wmf.webrequest has restricted historical data now, but I did check wmf_deprecated.webrequest and the maximum day in March is 31, so we do have the full month :)

Queries have been converted over for a monthly DAG, with the output being:

NOTE: The numbers below are partial for April and are just for testing.
monthtotal_help_items_viewstotal_help_sitelinks_viewstotal_help_edit_summary_viewstotal_wikidata_wikidata_in_wmp_viewstotal_wikidata_how_to_use_data_on_wmp_views
2025-045481201167274768

The columns are basically the page ID converted to snake_case:

  • total_help_items_views: Help:Items
  • total_help_sitelinks_views: Help:Sitelinks
  • total_help_edit_summary_views: Help:Edit_Summary
  • total_wikidata_wikidata_in_wmp_views: Wikidata:Wikidata_in_Wikimedia_Projects
  • total_wikidata_how_to_use_data_on_wmp_views: Wikidata:How_to_use_Wikidata_in_Wikimedia_Projects

Merge request airflow-dags#1236 is already open for the DAG and tests are passing, so we just need a quick review and we'll be done here 😊

DAG has been deployed and the originally reported data has been inserted into the new table. We'll get the first run on the 1st of May 😊

Moving into In Review :)

karapayneWMDE added a subscriber: AndrewTavis_WMDE.

Task is complete. please review and close or ping us on what was missed

Thanks Kara for the ping. Hey @AndrewTavis_WMDE, can I view it in Superset? I'll write to you separately about this. Thank you :)

Yes you can view this data in Superset, @Ifeatu_Nnaobi_WMDE :) The table is wmde.wit_docs_pageview_metrics_monthly 😊