Page MenuHomePhabricator

Allow recording Wikibase REST API usage for Wikidata monitoring and metrics
Closed, ResolvedPublic


As a Wikidata Product Manager I want to know how Wikibase REST API is used on Wikidata so I can reason on its adoption and usage trends.

WMDE's Wikidata Team has had a number of dashboards where they monitor the usage of APIs (action API currently) on Wikidata. They intend to also be able to include there reads and edits done using Wikibase REST API.

According to @Addshore's elaboration in April 2022 (via email, summarized below) the following dashboard are likely to be involved.

The relevant graphs on those dashboards seem to rely on data provided to the following WMF Graphite metrics

  • (not Wikibase specific, provided by Mediawiki framework)
  • wikibase.repo.api.getentities.entities (code) -- not relevant for Wikibase REST API currently, as there is no possibility to request data of specified number of items other than 1.

Based on the cursory search Mediawiki REST API currently does not record timing/usage of REST API on the framework level.
It likely means to in order to fulfill requirements of Wikidata Wkibase REST API would either need to allow optional metric tracking in scope of the REST API, or add such an option on the Mediawiki REST API framework level. The latter is likely to happen with the coordination with a currently unnamed WMF counterpart team.

Scope constraints:

  • Any adjusting of existing dashboards etc is not a task for the Wikibase Product Platform team and would be tracked separately
  • Enabling Wikidata team to collect usage metrics in their preferred system must be implemented in a way that does not require other Wikibase installation to have data/metric collection set up the same as WMF/Wikidata has
  • WMDE Wikidata team likely has requirements to logging REST API requests the way it seems to be done for Action API via Mediawiki framework. There seem to be no similar mechanism for Mediawiki REST API at this point. This will be tracked as a separate effort.
  • @Addshore hinted at WMDE's Wikidata metric also collecting data based on the request path hinting at API usage. Whether the data is stored already via Mediawiki REST API framework is to be confirmed as a follow up task. Adjusting the linked script is outside of responsibilities of Wikibase Product Platform team.

Event Timeline

Notes from Backlog Refinement:

  1. Figure out if the relevant data (number of requests and request execution time) can be found in Hadoop logs already and whether it is sufficient for Wikidata's tracking needs
  2. If not, define what service interface wikidata wants to use: Statsd/Graphite? Prometheus?

For the general understanding of REST API usage the following data could be worth tracking:

  • response codes by endpoint and method
  • number of PUT (maybe PATCH?) requests that do not have any effect on the data (see Idempotence)

With confirmed by querying Wikimedia Hadoop that request data is actually being recorded for any visualization or analysis that'd be intended.

Update: I have now created a dashboard prototype in Superset based on 1/128 sampled data for the launch. However, using an approach based on Grafana and Graphite seems more feasible long term.

For privacy reasons, the relevant data in Hive/Hadoop is limited to the last 30 days. Also, unsampled data is available only in a hard-to-use form. That means we would need to create a pipeline aggregating the data first and saving it in a new table to allow for long-term monitoring. Another downside of this approach is that the available visualization tools cannot create public dashboards yet (we would need to set up our own tools if we wanted to enable this).

This is why it seems more feasible to create a dashboard using Grafana. The relevant API data appears to be already sent to Graphite thanks to a patch by Daniel Kinzler (the data can be found under Graphite Browser, "Metrics" > "MediaWiki" > "rest_api_*").

See also: T327154: Create a basic monitoring of usage of Wikibase REST API on wikidata

@Manuel as a note, it really depends what one wants to find out.
Data which Mediawiki sends to Graphite only records request that made it to mediawiki - will not show that a certain request are made many times more but cached etc.
To know ALL request Hadoop is the way to go IMO. Preprocessed (no need private data for most of monitoring i'd guess etc) and stored in aggregated form somewhere as you say but if one wanted to see all usage on Wikidata that'd be the source of truth.

Wikibase Product Platform's job is done here though, so I will be closing this task.

To know ALL requests Hadoop is the way to go IMO.

Yes, that's true, we will have to keep that in mind! We will be able to measure the gap by comparing the results based on the two sources to see how urgent this is.

Also, the Grafana dashboard seems to not separate between and We might want to address this before we run our next tests.