Page MenuHomePhabricator

Create a basic monitoring of usage of Wikibase REST API on wikidata
Closed, ResolvedPublic5 Estimated Story Points

Description

As an Engineering Manager of the team building Wikibase REST API I would like to know how it works and is being used in production environments so that the team can notice unexpected behaviour and have a basic understanding of API's usage in the particular production environment.

For the immediate need we'll utilize the data Mediawiki itself sends to WMF Graphite, and plot it in Grafana dashboard, understanding the limitations of that data. Create the Grafana dashboard in a new "WMDE Wikibase Product Platform" folder.

Data to monitor

  • sum of all requests to Wikibase REST API in total and per endpoint
  • number of errors (4xx, 5xx) in total and per endpoint
  • number of success responses (one bucket containing everything in the 200-399 range) in total and per endpoint

Unless technically difficult getting numbers collected with the time granularity to a second would be ideal but we shall not intend to do more than the Mediawiki+Graphite allows.

Links:

Event Timeline

Jakob_WMDE set the point value for this task to 5.

@Manuel This seems more like a subtask of T314133: Allow recording Wikibase REST API usage for Wikidata monitoring and metrics than the parent task. This task will shortly be completed and I'm assuming discussion on the monitoring and metrics topic will continue in T314133? What do you think?

A Wikibase REST API on Wikidata dashboard has been created.

Lessons learned:

  • For these types of graphs, it was important to use the consolidateBy function with the 'sum' function after grouping or summing series together to prevent the average of the grouped/summed series to be displayed.
  • You can use graphite queries in graphite queries with a #[A-Z] (i.e. #A, #B, ..., #Z). This only works for queries labelled with a single letter though (e.g. you can't reference a query labelled #Errors).
  • You can hide graphite queries from displaying on the graph by hitting the eye icon. This is useful when combined with the previous bullet point.
  • Stacked graphs are better for seeing totals of series, non-stacked graphs are better for comparing series.
  • Remember to save the dashboard for panel changes to persist. Don't leave it too late to save the dashboard as your session might time out and you will loose your changes when it errors out trying to save.
  • You can find the version of graphite being used by going to https://graphite.wikimedia.org/version. This is useful to be able to view the correct version of the docs.

Thank you for creating this and adding your lessons learned!

Do you happen to have a recommended way to drill down into the errors? Say, understanding what 4xx and 5xx codes were returned. Or is that last panel only 5xx and we should see all the data in logstash?

The dashboard has multiple limitations but is good enough for now. Thank you!

@Michael From what I have understood the way Mediawiki reports those makes it impossible to get detail on the kind of error. We do have logstash to dig into 500 errors: https://logstash.wikimedia.org/app/dashboards#/view/c84cd2f0-9b6a-11ed-9fa2-93d8c6813b6b?_g=h@0d13436&_a=h@f1320bf
Does not quite help understanding what kind of 4xx errors happen how often, but we'll get there eventually I assume

We can't view the different 2xx and 3xx responses in grafana as that information isn't recorded with statsd, but we can view the different 4xx and 5xx responses. See Silvan's playpit dashboard for some examples.