Page MenuHomePhabricator

[Analytics] Analysis of REST API user agents for May 2024
Closed, ResolvedPublic

Description

Purpose

There was a substantial growth in REST API usage from April to May 2024. As PMs we would like to better understand where this growth is coming from (e.g. if we can connect it to events that were happening during that time)

Scope

  • REST API user agents for May 2024

Desired output

  • All unique user agents from May 2024 (with total API requests)
  • All unique user agents that were active in May 2024, but not April 2024 (with total API requests)
  • As this might contain private information, please send to Lydia and Manuel via appropriate channels and schedule for deletion.

Urgency

When this task should be completed by. If this task is time sensitive then please make this clear. Please also provide the date when the output will be used if there is a specific meeting or event, for example.

12.6.2024


Information below this point is filled out by the Wikidata Analytics team.

General Planning

Information is filled out by the analytics product manager.

Assignee Planning

Information is filled out by the assignee of this task.

Estimation

Estimate: Half a day
Actual: Half a day (total)

Sub Tasks

Full breakdown of the steps to complete this task:

  • All unique user agents from May 2024 (with total API requests)
  • All unique user agents that were active in May 2024, but not April 2024 (with total API requests)
  • Send results

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

  • Note

Event Timeline

Manuel removed Manuel as the assignee of this task.
Manuel updated the task description. (Show Details)

Quick note on this, in discussion, something to check as well would be those user agents that were present in May 2024, but were not active in April 2024 :)

Checking on the numbers here really quick: the request is for the top 1000 user agents by number of requests and then a sample of 1000 user agents, but the total is 1221. Would an ordered list of all of them make more sense as we're talking a sample of 82%? There really isn't going to be a difference between the first two sets. An ordered list of all of them and another ordered list of all who were active in May and not in April?

Base queries for all of this are ready :) Let me know on the above and I'll finalize them. Actually running them will take some time.

Re how to send the files: my suggestion would be that I put them into my stat1010 and then @Manuel can migrate them to his. From there I'll delete my copy and he can delete his once he and @Lydia_Pintscher are done checking them. Suggesting this as I can't move the files into another users' directory myself.

Generally from one's root the command would be:

# The last . is the current directory, and autocomplete should work.
cp ../andrewtavis-wmde/wikidata/2024/T366621_rest_api_user_agents/FILE_NAME.csv .

Let me know how this sounds!

Would an ordered list of all of them make more sense

Yes, definitely, @AndrewTavis_WMDE!

Re how to send the files:

Would it be possible to send us a spreadsheet (and schedule it for deletion after 90 days)?

Would it be possible to send us a spreadsheet (and schedule it for deletion after 90 days)?

I'd prefer to transfer via the servers if possible given the comment here from WMF Engineering. I'm also not sure how to schedule a spreadsheet for deletion, but can look into this if this would be preferable.

I can also prepare a notebook with quick functions to load and explore the data, if that would make the option I suggested a bit easier.

@AndrewTavis_WMDE: A file on the server seems too complicated to access for non-analytics PMs, so it would be great if you found a simpler solution. While a Google spreadsheet seemed most convenient, using e.g. WMDE's internal Wolke cloud could also be an option in case we need to avoid Google products for this.

I'm also not sure how to schedule a spreadsheet for deletion

Ah, I just meant setting up a reminder in Google calendar and deleting it manually.

@Manuel, my assumption was that you could help any non-analytics PMs or go through the results with them as you have the needed access. Using Google for PII is not something we're supposed to do if it can be avoided, but I have no experience with Wolke. Please let me know if you'd like me to look into Wolke or send the files over Drive.

Please let me know if you'd like me to look into Wolke or send the files over Drive.

Yes, that would be great, thank you. Anything relying on our normal UCS, SUL or Wikitech logins will do (it should not require a Kerberos authentication).

@Manuel and @Lydia_Pintscher, just shared a folder with the two CSVs on Wolke. Let me know if there's anything else needed, and I will set a reminder that they should be deleted on my end in 89 days (they were generated yesterday). Sharing has been disabled on the directory, so if others need access, then let me know :)

Perfect, thx! Resolving this now.

@Lydia_Pintscher: To exclude browsers, you can filter user agents to exclude everything "Mozilla".