Page MenuHomePhabricator

[Analytics][QuartSup M4] Quartely/monthly User Agents using Wikidata's new REST API
Closed, ResolvedPublic

Description

Wikidata Analytics Request

This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!

Purpose

Please provide as much context as possible as well as what the produced insights or services will be used for.

As Wikidata Product Managers, we would like to understand better how Wikidata's new REST API is used. T366621: [Analytics] Analysis of REST API user agents for May 2024 was very useful for this and we would like to get this information regularly.

Note to Lydia: To improve the quality of the data further, we would need to do steps to strengthen our users' compliance with the user agent policy (T329044).

Specific Results

Please detail the specific results that the task should deliver.

Wikidata PMs in the nda LDAP group (with PII access) receive the information from T366621: [Analytics] Analysis of REST API user agents for May 2024 once per quarter (if a manual step is involved) or once per month (just in case this was fully automated).

Optional: add a split by read/write operations (but don't bother if this is complicated)

Desired Outputs

Please list the desired outputs of this task.

Easy to access spreadsheet/csv (anything relying on our normal UCS, SUL or Wikitech logins will do, but it should not require a Kerberos authentication)

Deadline

Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

DD.MM.YYYY


Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

  • White a DAG to call a query for Wikidata REST API user agents
  • Write a query for user agents and total requests for a given month and the most before
  • Create the needed table and make sure that the analytics-wmde user has access to it
  • Test the query in a Spark SQL instance to make sure it runs properly
  • Test the DAG to make sure it finishes properly
  • Send out to Lydia and Ifrah for confirmation (ensure that Ifrah gets the right permissions first, if necessary)

Estimation

Estimate: Half a day
Actual: One day (mostly waiting for it to finish while working on other things)

Data

The tables that will be referenced in this task.

Notes

Things that came up during the completion of this task, questions to be answered and follow up tasks.

  • Note

Event Timeline

Manuel moved this task from Incoming to To-Do on the Wikidata Analytics (Kanban) board.
Manuel added a subscriber: Andrew-WMDE.
Manuel removed a subscriber: Andrew-WMDE.
Manuel updated the task description. (Show Details)
Manuel renamed this task from [Analytics] List of User Agents using Wikidata's new REST API to [Analytics] Quartely/monthly User Agents using Wikidata's new REST API.Jun 20 2024, 9:18 AM
Manuel removed Manuel as the assignee of this task.
Manuel updated the task description. (Show Details)

A question on this as I'm writing the basics while I'm waiting on info for the testing of the other DAGs:

  • Do we want only the data for the previous month, or should we also include the data from the last two months as well?
    • Checking to see if it was helpful to have the only May and not April user agents as well
    • I think that current month and previous month is the best easy backdate we can do as if we do more there could be more than 90 days (March, April and May together are 92 days)
    • Happy to include this as it's really not much more effort!

I'll write the DAG and we can see how access to the data works if it's in a table in the data lake. Very happy to send this info along on a monthly basis, and we can just have the habit to delete the last one when the new one is sent.

Good point, @AndrewTavis_WMDE, monthly iterations make most sense considering the 90 day limit. Let's go with what you suggested.

@Lydia_Pintscher, could you please still confirm if the comparison with the previous month helped to get added insights or not, so we know if to include it this table in future iterations or not.

MR is already open with the basic structure of the DAG (basically just the first REST API one with minor modifications). I'll finalize this once the REST API metrics one and the sitelinks one are finalized as far as exporting to the published datasets. Only question on those now is where I should be sending test data to (HDFS or the stat machines).

Hi @Lydia_Pintscher, fyi, I just decided to keep the comparison with the previous month and asked Andrew to create a "new this month" column to represent it. Cheers!

The DAG its job at this point are ready. The merge request has been updated and can be seen here: gitlab.wikimedia.org/repos/data-engineering/airflow-dags/merge_requests/738. I don't have time now to do the local testing as I tried earlier today and the query was taking too long, but this will be finished quickly after I'm back.

Manuel lowered the priority of this task from High to Medium.Jul 24 2024, 9:22 AM
karapayneWMDE renamed this task from [Analytics] Quartely/monthly User Agents using Wikidata's new REST API to [Analytics][REST API Sup M4] Quartely/monthly User Agents using Wikidata's new REST API.Aug 21 2024, 3:57 PM
karapayneWMDE renamed this task from [Analytics][REST API Sup M4] Quartely/monthly User Agents using Wikidata's new REST API to [Analytics][[QuartSup M4] Quartely/monthly User Agents using Wikidata's new REST API.
karapayneWMDE renamed this task from [Analytics][[QuartSup M4] Quartely/monthly User Agents using Wikidata's new REST API to [Analytics][QuartSup M4] Quartely/monthly User Agents using Wikidata's new REST API.Aug 21 2024, 3:59 PM

Took a quick moment to figure out what's going on with this DAG job query so that it can be included in all of the deployments we're doing in the coming days. Specifically the job was not finishing with the setup that we had before where we wanted user agents, their total requests, and whether they had also made a REST API request in the previous month. Optimized the query as much as I could, but it was not finishing over a monthly period. I've now edited the query so that we're just getting user agents and their total requests within the last month ordered in a decreasing manner based on the total requests. The query for this finishes in 15 minutes as the data ingested still means that query planning cannot be done in an optimized manner.

Will send along the create table statement for this so that we'll be ready for DAG testing with the rest of the ones we're working on :)

Task is done! closing