Page MenuHomePhabricator

[Analytics] Unique user-agents accessing Wikidata's REST API for Q2/2023
Closed, ResolvedPublic

Description

Problem:
As Wikidata PMs we want to understand the usage of Wikidata's new REST API better to make informed product decisions.

Questions:

  • How many clients make use of Wikidata's new REST API?
  • How many accesses do each of them make?
  • How many users use each of the clients?

How the data will be used:

Notes:

  • We will measure the "number of clients per month" in approximation based on the number of unique User-Agent HTTP headers used for the requests.
  • You will find the base data in the wmf.webrequest table. This table is continuously purged of data older than 90 days. So we will need to establish a data pipeline that regularly aggregates the data.
  • The table is huge and hard to work with. You can access hourly chunks via Superset (after signing the NDA). But to do meaningful work with this data, you will need Hive/Hadoop access and use our more powerful analytics systems.
  • The data can be accessed via SQL Lab in Superset using the presto_analytics_hive database (the server is however not powerful enough to cope with more than one day at a time):
SELECT
  COUNT(*) AS hits, user_agent, http_method 
FROM
  wmf.webrequest
WHERE 
    year = 2023
    AND month = 03
    AND day = 22
    AND uri_host = 'www.wikidata.org'
    AND uri_path LIKE '/w/rest.php/wikibase/%'
GROUP BY user_agent, http_method 
ORDER BY COUNT(*) DESC
LIMIT 1000

Assignee Planning

Information below this point is filled out by WMDE Analytics and specifically the assignee of this task.

Sub Tasks

Full breakdown of the steps to complete this task:

  • Define all database tables and columns needed and add them to the ticket
  • Defining the output and check with stakeholders (@Manuel)
    • Table with the following columns
      • Year, month, day, user_agent, total_requests, total_unique_ips
      • All data including the user agents must stay private, but we can e.g. publish the dataset using pseudonyms.
      • unique_ips would be on a daily and monthly basis
    • Baseline aggregation in a notebook
  • Explore how to complete a data aggregation process
    • Where to store the query and output data?
      • Queries within the analytics repo on GitLab? (yes - I just requested access to GitLab)
      • Where would such tables that we want to create live within the data lake?
      • They can make us a new database (we just need to provide a name)
    • When to run it
      • In the morning the next day after the period ends? (day, month, quarter)
      • Right after the run of the last flow for the period
      • We can use tags like @daily or @monthly that run them after the period ends
    • Relation to WMF flows within Airflow
      • I'm assuming that there are .xml/.yaml files that define the flows? (file defining most of our datasets)
      • This would be run after the requests flow, but what else would proceed it in the flows just for our overall understanding of its position within the general graph?
    • How to test it prior to deployment
    • How will we be able to access it?
      • We'd like to use Grafana, so output to Graphite/Prometheus?
      • Or how easy is it to import this data into the WM Cloud Superset?
      • Suggestion is to use Superset, with further testing for the WM Cloud one being necessary to see if data import will work effectively
    • How will errors be reported/handled if something goes wrong
      • Email notification, or can we set it up such that we get an alert within their flows notifications channel on Slack? Potentially set up a group like wmde-analytics?
      • We'd prefer to be able to mute the full alerts channel and just be notified when something that belongs to us has failed.
      • There are email alerts for Airflow
    • Can we backdate the information for April, May and June of this year so that it's available for a quarterly aggregate?
      • Yes, let them know when to set the start_date to once we're hitting production
  • Basic investigations of columns in the data and what information they could provide if we subset based on them
  • Writing the aggregation queries
  • Try to extrapolate the numbers for April (based on the data that we have)
    • Monthly totals
      • June 2023: 13213
      • June 2023 (filtering out user_agent values that match bot generated versions): 126
      • May 2023: 141
      • May 2023 (filtering out user_agent values that match bot generated versions): 122
      • April 2023 (from the 11th): 87
        • Basic math
          • 87 / ((30-11) / 30) ≈ 137 (similar to May)
          • Difference between May with and without filtered IPs is 141 - 122 = 19
          • 137 - 19 = 118, so we could say that we're looking at roughly 118 for April 2023 with filtering, which further gives us a slight "consistent growth"
  • Also get the 30-days values for the last days of each Q2 month (in preparation of potentially switching to rolling 30-days later)
    • Last 30 day totals
      • June 2023: 13213 (the same, as June has 30 days)
      • June 2023 (filtering out user_agent values that match bot generated versions): 126 (the same, as June has 30 days)
      • May 2023: 140 (different, as May has 31 days)
      • May 2023 (filtering out user_agent values that match bot generated versions): 121 (different, as May has 31 days)
  • Document all the numbers for Q2 in this task
  • Let's discuss what we can do to make the metric more robust and reliable (e.g. exclude browser user agents)

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

  • wmf.webrequest as a basis for an eventual aggregation table
  • Columns used will be: year, month, day (potentially), user_agent, client_ip
  • Columns used in WHERE clause:
    • uri_host = 'www.wikidata.org'
    • uri_path LIKE '/w/rest.php/wikibase/%'
  • Columns maybe used: agent_type, http_method (for investigation)

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

  • Q: Which column defines the REST API endpoints within the wmf.webrequest table?
    • A: uri_path
  • Q: Do we want to add more details (e.g. split the hits by endpoint)?
  • A: With endpoints defined by uri_path we're getting entries split by individual QIDs, which isn't necessary
  • A: we could split the paths based on information like GET and POST as a HTTP method on item descriptions, properties, etc that could then be used later :)
  • A: not necessary right now, but keep it in mind
  • Q: Do we need more detail than months or not?
  • A: I'd say it makes sense to include days.
  • Q: Where/how do we best store the aggregated data?
  • A: We want the data in the data lake. Ideally, we would also have that data available in Grafana. So this would be a good moment to see how we can push it to Graphite.
  • A: Adding a flow/flows to Airflow that are ran at the end of a day, month and quarter
    • Quarter aggregations difficult as they could be over 90 days, but we could just do last 90 days
  • Q: Do we need to do separate aggregation queries as the same client_ip would need to be accounted for over days, months and quarters for a given user_agent?
  • A: We should be doing separate SQL flows for days and months, but not for quarters for client_ip
  • Q: What's the method used to anonymize the data like user_agent values?
  • A: Potentially look into generic descriptions like "Python request".
  • A: For now only data without user_agent would be published as it could include PII.
  • Q: Do we need to do separate aggregation queries for users and spiders?
  • A: This might be a bit extreme, but would save us this data moving forward.
  • Not at this time, but something to look into.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)

FYI @Manuel, this is the breakdown of agent_type for yesterday's REST API requests:

agent_typetotal_requestspercent_of_total
spider1178896.6
user4183.4

And the same for http_status is:

http_statustotal_requestspercent_of_total
2001203698.6
3041441.2
404160.1
40060
40330
30810

I've written very detailed pseudocode for all the potential table creation scripts. Just need to figure out the insertion step after we get more information from WMF and then the queries should be done :)

  • Where to store the query and output data?

Most of our users store their data in HDFS, as part of a Hive table.

  • Queries within the analytics repo on GitLab?

Yes that is what we recommend these days. You will have your GitLab repo, and perhaps inside of it a sql folder with all your queries. Then, from Airflow, you can trigger a particular SQL command via SparkSqlOperator. See docs here, but if you have not used Airflow before, we can have a meeting to get you on your way.

  • Where would such tables that we want to create live within the data lake?

We can create a new database for you, and you can start adding tables there as needed. Let us know a database name that would make sense for you.

  • When to run it
    • In the morning the next day after the period ends? (day, month, quarter)

This is easily set on your Airflow DAGs as there are schedule tags like @daily or @monthly, and yes, they run after the period ends.

  • Relation to WMF flows within Airflow
    • I'm assuming that there are .xml/.yaml files that define the flows?
    • This would be run after the requests flow, but what else would proceed it in the flows just for our overall understanding of its position within the general graph?

We do have a file defining most of our datasets in Airflow. But I think all you need to care about is whether the webrequest data is available for a specific time period. We have built tooling making this a one liner. That line gets you an Airflow sensor that knows the details on how to wait for webrequest data. This is probably confusing if you have not done Airflow before, so if that is the case, let's setup an intro call.

  • How to test it prior to deployment
    • Specifically is there a test server where new flows are ran against the graph?

We have instructions on how to launch your own development environment on stat machines.

  • How will we be able to access it?
    • We'd like to use Grafana, so output to Graphite/Prometheus?
      • Or how easy is it to import this data into the WM Cloud Superset?

From @Ottomata vis Slack:
re grafana/prometheus: this is difficult for aggregate data, because prometheus does not allow you to produce data with historical times.
best viz tool we use is: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Superset

  • How will errors be reported/handled if something goes wrong
    • Email notification, or can we set it up such that we get an alert within their flows notifications channel on Slack? Potentially set up a group like wmde-analytics?
    • We'd prefer to be able to mute the full alerts channel and just be notified when something that belongs to us has failed.

You can set an email or email array per Airflow DAG. Regarding Slack, we currently don't have an integration like that, but if you want to work on it, definitely go for it!

  • Can we backdate the information for April, May and June of this year so that it's available for a quarterly aggregate?

Yes, you can do a 'backfill' of data. Whenever we are ready to go to production, we would set the start_date of your DAG to whatever date makes sense to you and it will auto backfill.

We have office hours if you want to ask more questions. Here is a google calendar link to next office hours.

Also, we should figure out whether we can set you up on the platform_eng Airflow instance, or if we should launch a new one. CC @lbowmaker.

@xcollazo and @Ottomata: Thank you for your input! An introduction to Airflow for @AndrewTavis_WMDE and me would be fantastic! I'll contact you on Slack to talk more about it.

Thanks for the detailed explanations, @xcollazo! Really is such a wonderful wealth of information :) We have a call with you all to discuss Airflow later today that will also help explain some more points as far as really getting it all set up. Can't want to get into the flow of all this!

Note for analytics on our end: further useful exploration of webrequests based on a conversation with @Manuel this morning would be:

  • Breaking down agent_type also by user_agent and share with @Manuel privately
  • The example queries for aggregation tables should be segmented based on http_status, specifically into successful and unsuccessful queries (we're not doing to include this in these workflows, but queries to subset based on that should be written)

@Manuel, here's the breakdown of http_status for requests yesterday just to show that I have the segmentation for that setup now. Am just adding WHERE http_status LIKE '2%' to a CASE WHEN for this.

http_status_oktotal_requestspercent_of_total
True781497.4
False2052.6

I'll send along a link to a Google Sheet with the breakdown for agent_type and user_agent soon. We can discuss from there if you'd like a larger dataset generated for that :)

Notes from the WMF-WMDE meeting on Airflow with @mforns:

  • data-engineering/airflow-dags houses all the DAGS for scheduling
    • Data transformation doesn't happen here
    • We'll need our own folder for WMDE Analytics
  • Each repo has a config and dags
    • dag config will be written by data engineering
    • artifacts.yaml may not be necessary as of now (for Java jars and other tools)
    • datasets.yaml will be necessary
  • *_dag.py files define DAGs
    • DagProperties allows for properties to be overwritten at runtime (also for testing)
    • We'll use this for setting the start date and alert emails, etc
    • SLAs (service level agreements) can be defined for notifications, etc
    • URLs are raw files within GitLab or another repo
      • Consider using the versioned version of queries as if we update the query Airflow will run the DAG
      • Properly testing the workflows should mean that we can deploy from main though
    • Airflow macros are used to define the times for DAG runs (year, month, day, etc)
  • SSH tunnel for Airflow UI
    • Development instance of Airflow can be spun up and should be ran per new flow being added
    • Clone the script in a stats machine and call it
    • We're system users and thus need to create a home env /tmp folder, for example
    • DagProperties automates the creation of airflow variables that we can then edit/override for testing
    • Override the variables that determine where the data processes export, etc
    • Write to our separate user instance of the database
      • If we don't have a personal database yet, then we should make one
    • System users are read only for prod data and can only write to our own (or where we have access to)

Note that I've made T338785 for my GitLab access :)

Thx, @AndrewTavis_WMDE!

Am just adding WHERE http_status LIKE '2%' to a CASE WHEN for this.

Broadly speaking 2xx and 3xx would be successful, and 4xx or 5xx would not be successful. https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

I am sure that we can find some Wikimedia-specific example code.

@mforns, we'd discussed creating a directory for WMDE Analytics in the GitLab repo for Airflow DAGs :) @Manuel, should we finalize the name on this and get it created? Product analytics at WMF appears to be in analytics_product, so we could do analytics_wmde or wmde_analytics if we want a bit more "separation" of our work within local copies of the repo?

@AndrewTavis_WMDE Hi! I think you could go with simply wmde. The analytics prefix in product_analytics exists because the team is named like that. In your case, you could use wmde I think.
BTW this is the task to create the WMDE Airflow instance: T340648

We should avoid team names in functional code / namespacing. Team names change often.

However, wmde is more like 'wmf' in this case, and I think not likely to change. +1 to wmde. :)

Thanks both for the quick responses! I'll let @Manuel make the final call, but makes sense on my end :)

Sounds good to me, thx!

The wmde name goes way beyond the scope of our team (or even the software department). But if I understand correctly, we could still organize this further using subfolders (e.g. going by products). If this assumption is right, I would suggest that we start with wmde but place our teams Airflow DAGs only in a subfolder that we will definitely still own in the future (e.g. wmde/wikidata/dags). That way the top level folder structure still remains simple and we are prepared for potential future developments.

^ sounds good!

wikidata could be an okay name too. I like functional groupings. However, that might overlap with some dags that the search platform team does for wikidata in their search instance. ¯\_(ツ)_/¯

As the current structure of the repo is till now always the following:

data-engineering/airflow-dags/
    └─team_folder
        └─ config
            | config_file
            ...
        └─ dags
            └─ sub_project
                | sub_project_dag.py
                ...
            ...
    ...

I'd suggest that we stick with wmde in this case and we can organize ourselves within the wmde/dags directory as we progress :)

Yes, this makes sense to me as well. Let's go with only 'wmde' and we can add a wikidata subdirectory ourselves (see also task about setting up our Airflow instance T340648).

Manuel renamed this task from [Analytics] Unique user-agents accessing Wikidata's REST API to [Analytics] Unique user-agents accessing Wikidata's REST API for Q2/2023.Jul 6 2023, 1:23 PM
Manuel renamed this task from [Analytics] Unique user-agents accessing Wikidata's REST API for Q2/2023 to [Analytics] Airflow implementation of unique user-agents accessing Wikidata's REST API .Jul 7 2023, 8:47 AM
Manuel renamed this task from [Analytics] Airflow implementation of unique user-agents accessing Wikidata's REST API to [Analytics] Unique user-agents accessing Wikidata's REST API for Q2/2023.
Manuel updated the task description. (Show Details)
Manuel triaged this task as High priority.Jul 10 2023, 8:37 AM

Checks that we could do to fix issues with users creating multiple user_agent would be:

  • Switch this metric to counting unique ip values
  • Count unique user_agent values, but only if they have a unique ip
    • If they have a non-unique ip, then they would only count as one user_agent

@Manuel, something else to explore would be to see if we could figure out a metric that links user_agent and ip. I'm a bit confused why we'd go through this and then still we have tons of unique individuals within Python's requests count. A breakdown:

  • User has unique user_agent and ip
    • All's fine - select the user_agent/ip pair
  • User has unique user_agent and multiple ip values
    • We're assuming that the ip value was changed by their provider and select one user_agent/ip pair
  • User has a unique ip but is using multiple user_agent values
    • We select one of the user_agent/ip pairs assuming that it's unique and that the others are generated repeats

This basically boils down to:

  • A selection of user_agent/ip pairs where each value can only occur once within the pairs

For the above we'd get a few extra values based on those users who are generating and getting their ip switched, but then this might be better than dramatically undershooting our total user base? Yes it changes the metric, but the goal is an estimation of usage that the Python requests user_agent is masking.

A selection of user_agent/ip pairs where each value can only occur once within the pairs

I don't think so: It's only the IP value that should only occur once. So a more robust metric than the original would likely not rely on user agents at all (e.g. unique IPs).

But in addition to a new metric, we also want to keep reporting the original metric (so that we do not have to officially change the metric through all levels of the org hierarchy). For this we will need to consider the following:

unique user_agent and unique ip

All is fine.

unique user_agent but multiple ip values

This will primarily be because of (multiple different) users using only default values for their user agent. Also, as you wrote, the same users' IPs changed by their provider. In the logic of the original metric we need to consider this as only one user_agent (likely underestimating what we are actually interested in).

unique ip but multiple user_agent values

If this happens too often (as we saw in the data recently) then this makes the original metric meaningless.. so if we wanted to continue to report something in the logic of the original metric, we would need to filter the additional user_agents out (and consider only one user_agent per IP).

And this boils down to user_agent/ip pairs where each value can only occur once in the whole time period considered (e.g. last month/30d).

Does this help?

I'm still a bit confused. The above comment starts with the following:

A selection of user_agent/ip pairs where each value can only occur once within the pairs

I don't think so: It's only the IP value that should only occur once. So a more robust metric than the original would likely not rely on user agents at all (e.g. unique IPs).

... and ends with this:

And this boils down to user_agent/ip pairs where each value can only occur once in the whole time period considered (e.g. last month/30d).

You're more saying that each individual user_agent or ip can only occur once, and I'm more on any combination cannot include both a user_agent and an ip that aren't unique. To clarify where I think this leaves us though, at the end of this we want user_agents, but we just want them to be unique based on ip values?

unique user_agent but multiple ip values

This will primarily be because of (multiple different) users using only default values for their user agent. Also, as you wrote, the same users' IPs changed by their provider. In the logic of the original metric we need to consider this as only one user_agent (likely underestimating what we are actually interested in).

Specifically I'm thinking about the first clause in your comment above where we have everyone lumped into Python requests. Is the purpose of this metric to define unique users of the API? If so, then dramatically undershooting it when we have a way of correcting it via pairing the values is a bit confusing. The big question for me is what are we trying to say in the story here? If at the end we're reporting unique users and we have the caveat that it's undershooting by thousands, then I think that for now it's ok to provide this information, but we really should consider reworking this metric. user_agent undershoots it by a lot and ip overshoots it by a lot, so neither really answers this story well.

Could be something that I could discuss with WMF though 🙃

In talking a bit with folks at the Data Platform Engineering Office Hours the general sentiment was that WMF is currently hashing user_agent and ip for "unique" users, but that there's work to be done here. There's a fair amount of entropy in this method, but then it does generally provide a better metric for reporting than either user_agent or ip alone. References on this:

The usage of (quote unquote) "unique" above comes from the fact that this is an imperfect process that is only getting more imperfect as the internet changes. Chrome for instance is apparently making changes to their user agents whereby there are more merges of users with the above methods, but they don't appear to have immediate plans to switch to anything else for reporting in the future. That user_agent appears to be getting only more problematic for the most popular browser would further point to a bit of a rethink on this. Continuing with the planned reporting is of course very reasonable though! 😊

Is the purpose of this metric to define unique users of the API?

The purpose of the metric is to approximate unique developers that are using the new API, yes.

If at the end we're reporting unique users and we have the caveat that it's undershooting by thousands, then I think that for now it's ok to provide this information, but we really should consider reworking this metric. user_agent undershoots it by a lot and ip overshoots it by a lot, so neither really answers this story well.

This exactly why wanted us to discuss potential alternatives to the original metric. I am not certain by how much IP overshoots (maybe by a lot, but maybe it's not that bad). We can also have more than one metric. There are also other aspects that are relevant, e.g. how much use there is per user.

Could be something that I could discuss with WMF though 🙃

Sure, they might have best practices about robustly measuring the unique developers that are using their public APIs, this would help us for sure.

In talking a bit with folks at the Data Platform Engineering Office Hours the general sentiment was that WMF is currently hashing user_agent and ip for "unique" users, but that there's work to be done here.

A hash of User Agent and IP would make no sense in our case, unfortunately. This would basically just result in a max overshoot.

So it seems that the best we have so far is unique IPs. It is not ideal, as it will undershoot e.g. scripts that run on cloud infrastructure, and overshoots e.g. scripts that run in the browser of end users or with dynamically assigned IPs, but it's a solid candidate to accompany the original unique user_agent metric.

For the original metric we need to filter out all user agents that follow the problematic pattern (e.g. by counting unique IPs instead of user agents whenever we see the problematic pattern).

For the initial report where we just want to count certain user_agent values as one via their ip:

  • Match it with regular expressions within the SQL as the initial try
  • Check Python packages that help with user_agent values
    • python-user-agents is for identifying devices through their user_agent data
    • user_agent generates random user_agent data (potentially what those users are using)
    • uap-python is for parsing user_agent data
  • The date selection function I wrote needs to be changed to be explicit exclusion at the end of the time period
  • More documentation is needed for the function for date WHERE clause

@Manuel, the task has now been updated with all the metrics. Took a bit for all the queries to run :) Let me know if you have further thoughts on the final subtask:

  • Let's discuss what we can do to make the metric more robust and reliable (e.g. exclude browser user agents)

@AndrewTavis_WMDE: Thx for the update! We last aligned on completely removing all user agents that fit the malicious pattern. Are 126 for June and 122 for May the result of this? I am asking because the tasks description says "filtering out ips with user_agent values that match bot generated versions".

Hey @Manuel 👋 126 and 122 are a result of what it is that we aligned on, yes :) I just didn't update the description properly. Will do so now 😊

It's now the following:

filtering out user_agent values that match bot generated versions

Let me know if you'd like me to post the query here as well :)

Thank you!

About T341330:

  • We last aligned on creating running 30 days only for unique IPs and not for user agents (as they are not robust enough for continuous monitoring).
  • That means that we will need the unique user agents only for the 2023 reporting (per month, filtered plus potentially manual data cleaning).

Could you modify the subtask accordingly?

Sure, I'll head over to T341330 and edit the task so it's aligned with what we discussed :)

Are there still open questions that need my input or are we done and I only need to sign off (and close) the task?

  • Let's discuss what we can do to make the metric more robust and reliable (e.g. exclude browser user agents)

The above was still do be done, but I'd say it's finished and this is good to close :)

The above was still do be done

Ah, I see! Yes, I agree, we have talked about this and we are trying out 30 rolling unique IPs as a part of the Airflow version.

So we are done here.^^