Problem:
As Wikidata PMs we want to understand the usage of Wikidata's new REST API better to make informed product decisions.
Questions:
- How many clients make use of Wikidata's new REST API?
- How many accesses do each of them make?
- How many users use each of the clients?
How the data will be used:
- Enable quarterly reporting of "Developers begin accessing Wikidata’s data via the new REST API (as measured in approximation based on the number of unique User-Agent HTTP headers used for requests per month)". The plan is to report this staring Q2 2023.
- Help identify more refined and meaningful metrics in the future that PMs will continuously monitor to understand Wikidata.
Notes:
- We will measure the "number of clients per month" in approximation based on the number of unique User-Agent HTTP headers used for the requests.
- You will find the base data in the wmf.webrequest table. This table is continuously purged of data older than 90 days. So we will need to establish a data pipeline that regularly aggregates the data.
- The table is huge and hard to work with. You can access hourly chunks via Superset (after signing the NDA). But to do meaningful work with this data, you will need Hive/Hadoop access and use our more powerful analytics systems.
- The data can be accessed via SQL Lab in Superset using the presto_analytics_hive database (the server is however not powerful enough to cope with more than one day at a time):
SELECT COUNT(*) AS hits, user_agent, http_method FROM wmf.webrequest WHERE year = 2023 AND month = 03 AND day = 22 AND uri_host = 'www.wikidata.org' AND uri_path LIKE '/w/rest.php/wikibase/%' GROUP BY user_agent, http_method ORDER BY COUNT(*) DESC LIMIT 1000
Assignee Planning
Information below this point is filled out by WMDE Analytics and specifically the assignee of this task.
Sub Tasks
Full breakdown of the steps to complete this task:
- Define all database tables and columns needed and add them to the ticket
- Defining the output and check with stakeholders (@Manuel)
- Table with the following columns
- Year, month, day, user_agent, total_requests, total_unique_ips
- All data including the user agents must stay private, but we can e.g. publish the dataset using pseudonyms.
- unique_ips would be on a daily and monthly basis
- Baseline aggregation in a notebook
- Table with the following columns
- Explore how to complete a data aggregation process
- Where to store the query and output data?
- Queries within the analytics repo on GitLab? (yes - I just requested access to GitLab)
- Where would such tables that we want to create live within the data lake?
- They can make us a new database (we just need to provide a name)
- When to run it
- In the morning the next day after the period ends? (day, month, quarter)
- Right after the run of the last flow for the period
- We can use tags like @daily or @monthly that run them after the period ends
- Relation to WMF flows within Airflow
- I'm assuming that there are .xml/.yaml files that define the flows? (file defining most of our datasets)
- This would be run after the requests flow, but what else would proceed it in the flows just for our overall understanding of its position within the general graph?
- How to test it prior to deployment
- Specifically is there a test server where new flows are ran against the graph?
- Instructions for setting up a development env
- How will we be able to access it?
- We'd like to use Grafana, so output to Graphite/Prometheus?
- Or how easy is it to import this data into the WM Cloud Superset?
- Suggestion is to use Superset, with further testing for the WM Cloud one being necessary to see if data import will work effectively
- How will errors be reported/handled if something goes wrong
- Email notification, or can we set it up such that we get an alert within their flows notifications channel on Slack? Potentially set up a group like wmde-analytics?
- We'd prefer to be able to mute the full alerts channel and just be notified when something that belongs to us has failed.
- There are email alerts for Airflow
- Can we backdate the information for April, May and June of this year so that it's available for a quarterly aggregate?
- Yes, let them know when to set the start_date to once we're hitting production
- Where to store the query and output data?
- Basic investigations of columns in the data and what information they could provide if we subset based on them
- Writing the aggregation queries
- Try to extrapolate the numbers for April (based on the data that we have)
- Monthly totals
- June 2023: 13213
- June 2023 (filtering out user_agent values that match bot generated versions): 126
- May 2023: 141
- May 2023 (filtering out user_agent values that match bot generated versions): 122
- April 2023 (from the 11th): 87
- Basic math
- 87 / ((30-11) / 30) ≈ 137 (similar to May)
- Difference between May with and without filtered IPs is 141 - 122 = 19
- 137 - 19 = 118, so we could say that we're looking at roughly 118 for April 2023 with filtering, which further gives us a slight "consistent growth"
- Basic math
- Monthly totals
- Also get the 30-days values for the last days of each Q2 month (in preparation of potentially switching to rolling 30-days later)
- Last 30 day totals
- June 2023: 13213 (the same, as June has 30 days)
- June 2023 (filtering out user_agent values that match bot generated versions): 126 (the same, as June has 30 days)
- May 2023: 140 (different, as May has 31 days)
- May 2023 (filtering out user_agent values that match bot generated versions): 121 (different, as May has 31 days)
- Last 30 day totals
- Document all the numbers for Q2 in this task
- Let's discuss what we can do to make the metric more robust and reliable (e.g. exclude browser user agents)
Data to be used
See Analytics/Data_Lake for the breakdown of the data lake databases and tables.
The following tables will be referenced in this task:
- wmf.webrequest as a basis for an eventual aggregation table
- Columns used will be: year, month, day (potentially), user_agent, client_ip
- Columns used in WHERE clause:
- uri_host = 'www.wikidata.org'
- uri_path LIKE '/w/rest.php/wikibase/%'
- Columns maybe used: agent_type, http_method (for investigation)
Notes and Questions
Things that came up during the completion of this task, questions to be answered and follow up tasks:
- Q: Which column defines the REST API endpoints within the wmf.webrequest table?
- A: uri_path
- Q: Do we want to add more details (e.g. split the hits by endpoint)?
- A: With endpoints defined by uri_path we're getting entries split by individual QIDs, which isn't necessary
- A: we could split the paths based on information like GET and POST as a HTTP method on item descriptions, properties, etc that could then be used later :)
- A: not necessary right now, but keep it in mind
- Q: Do we need more detail than months or not?
- A: I'd say it makes sense to include days.
- Q: Where/how do we best store the aggregated data?
- A: We want the data in the data lake. Ideally, we would also have that data available in Grafana. So this would be a good moment to see how we can push it to Graphite.
- A: Adding a flow/flows to Airflow that are ran at the end of a day, month and quarter
- Quarter aggregations difficult as they could be over 90 days, but we could just do last 90 days
- Q: Do we need to do separate aggregation queries as the same client_ip would need to be accounted for over days, months and quarters for a given user_agent?
- A: We should be doing separate SQL flows for days and months, but not for quarters for client_ip
- Q: What's the method used to anonymize the data like user_agent values?
- A: Potentially look into generic descriptions like "Python request".
- A: For now only data without user_agent would be published as it could include PII.
- Q: Do we need to do separate aggregation queries for users and spiders?
- A: This might be a bit extreme, but would save us this data moving forward.
- Not at this time, but something to look into.