Page MenuHomePhabricator

[Analytics] Airflow implementation of unique ips accessing Wikidata's REST API metrics
Closed, ResolvedPublic

Description

Based on the results of T334558: [Analytics] Unique user-agents accessing Wikidata's REST API for Q2/2023, but implement as a running 30 day average.

Problem:
The Wikidata/Wikidata Analytics team would like to know unique users who are accessing the new REST API. One issue is that the information from the webrequests table is only retained for 90 days. Setting up a process within the WMF Airflow DAGs would allow us to save aggregated copies of the data for future reporting. T334558 focused on user_agent metadata associated with API requests, based on the results of this task Wikidata Analytics decided to switch to IP tracking for our internal metrics as there were many cases where user_agent data was being manipulated to allow for easier access (see T329044: Require clients to follow our User-Agent policy).

How the data will be used:

  • This data will be used for WMDE quarterly reporting related to the REST API.
  • Help identify more refined and meaningful metrics in the future that PMs will continuously monitor to understand Wikidata.

Assignee Planning

Information below this point is filled out by WMDE Analytics and specifically the assignee of this task.

Sub Tasks

Full breakdown of the steps to complete this task:

  • Getting the general structure of the wmde directory setup on GitLab
  • Testing the aggregation queries within the analytics flow
  • Deployment of REST API request aggregation queries

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

  • Note

Event Timeline

Manuel renamed this task from [Analytics] Unique user-agents accessing Wikidata's REST API for Q2/2023 to [Analytics] Airflow implementation of unique user-agents accessing Wikidata's REST API.Jul 7 2023, 8:48 AM
Manuel renamed this task from [Analytics] Airflow implementation of unique user-agents accessing Wikidata's REST API to [Analytics] Airflow implementation of unique user-agents accessing Wikidata's REST API metrics.
Manuel updated the task description. (Show Details)
Manuel edited projects, added Wikidata Analytics; removed Wikidata Analytics (Kanban).
Manuel added a subscriber: AndrewTavis_WMDE.
AndrewTavis_WMDE renamed this task from [Analytics] Airflow implementation of unique user-agents accessing Wikidata's REST API metrics to [Analytics] Airflow implementation of unique ips accessing Wikidata's REST API metrics.Jul 18 2023, 1:06 PM
AndrewTavis_WMDE claimed this task.
AndrewTavis_WMDE updated the task description. (Show Details)
Manuel changed the task status from Open to Stalled.Oct 17 2023, 8:50 AM
Manuel moved this task from Needs PM work to Monitoring on the Wikidata Analytics board.

Note that this task will include the user_agent values as well as we'll be doing the typical reporting metrics in one query. As of now we had three different functions being ran, but this can be simplified to one HiveQL process that then updates a table (Iceberg). The query itself is already written, and I'm working a bit on setting up the directories for Airflow DAGs as well as the Wikimedia Deutschland repo on GitLab.

Manuel changed the task status from Stalled to Open.Mar 14 2024, 10:35 AM

Merge request for this has been sent and can be found here :) Requested WMF's review on this first one, but we'll need to take over from there unless there are problems with it all. Deployment can happen after it's brought in.

Merge request has been brought in, and we've successfully deployed! 🎉 An output from the new wmde.wd_rest_api_metrics_monthly table is:

monthtotal_user_agentstotal_filtered_user_agentstotal_ips
2024-02-0145842414539

We have lift-off! 🎉