Metrics about the use of the Wikimedia web APIs
Open, NormalPublic

Description

The mission of the Wikimedia Foundation is to empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally.

In order to know how well we are doing with our mission, we invest a lot of attention in metrics about visits and contributions to Wikimedia sites. However, our web APIs also contribute to this mission by enabling community and third party developers to spread and improve our free knowledge via software. So far, we don't seem to have metrics about the use of our web APIs, and we don't seem to have any ongoing or planned initiative to address this problem.

Let's define what API metrics we need, and what would it take to produce them.

Reasoning

The reasoning for this request is simple: we are measuring how many readers and editors we have based on the activity performed directly on Wikimedia websites. However, our web APIs allow to access, create, and modify Wikimedia content through other ways, andwe are not measuring the volume, percentages, and trends of these activities.

The goal is to measure the activities performed through our web APIs in order to know

  • better numbers about readership and contributors, counting web + API
  • any numbers about the use of our web APIs, so we can plan better to improve them and increase their impact on readership and contributions.

It would be useful to have an identification of domains, at least to have data about the activity originated in Labs (community tools) versus the activity originated in third party servers. All the better if we can identify the main third party services using our web APIs.

Metrics requested

Specific metrics requested, and the stakeholders requesting them:

  • Number of user agents coming from Labs or third party services, on a monthly basis + all time (DevRel, to check whether our APIs are increasing adoption)
  • Volume of API requests coming from Labs or third party services, on a monthly basis (DevRel, to check the trend of usage of our APIs)
  • Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis + all time (DevRel, to help identifying the services making intensive use of our APIs)
  • Ranking of most requested actions/parameters, on a monthly basis + all time (DevRel, to help identifying usage of our APIs and check against our documentation, APIs we should promote...)
  • Counts of errors (T113672) by action and user agent, in order to identify problem areas and proactively reach out to API clients getting errors (DevRel and documentation)

Related Objects

StatusAssignedTask
ResolvedQgil
ResolvedKeegan
DeclinedNone
ResolvedQgil
ResolvedQgil
InvalidNone
InvalidNone
ResolvedNone
DeclinedQgil
ResolvedQgil
OpenNone
OpenNone
ResolvedTgr
OpenTgr
ResolvedArielGlenn
Resolvedbd808
ResolvedNone
ResolvedDzahn
Resolvedbd808
ResolvedJoe
ResolvedJoe
ResolvedJoe
ResolvedJAllemandou
ResolvedOttomata
Resolvedbd808
Resolvedbd808
OpenNone
OpenNone
Resolvedbd808
StalledTgr
OpenNone
There are a very large number of changes, so older changes are hidden. Show Older Changes

@Qgil there is no disagreement here. "I'd rather focus on obtaining useful metrics of our web APIs" Could you list what these useful metrics are?

Qgil updated the task description. (Show Details)Sep 9 2015, 4:50 PM
Qgil added a comment.Sep 9 2015, 4:53 PM

I have added a "Metrics requested" section in the description and I have added the metric that the upcoming Developer Relations is proposing to have as KPI. @SVentura, I guess that metric is useful for Strategic Partnerships as well.

I think Reading should be interested in metrics related to distribution of Wikimedia content through 3rd party applications, and the same might be true for Editing and contributions to Wikimedia content done via third party applications.

@Qgil, you're right, different groups will have different KPI lenses - we're meeting this afternoon to discuss ours.

Tgr added a comment.Sep 9 2015, 10:17 PM

Number of active users of Wikimedia web APIs hosted in Wikimedia Labs and third party servers (requested by Engineering Community).

That metric seems problematic. For browser-based requests, we are not tracking unique visitors so we could measure access counts but not number of users. For other requests, tracking is probably not even possible in theory. We could count the number of different user agents and try to filter out noise but that seems hard to do reliably.

Tgr added a comment.EditedSep 9 2015, 10:29 PM

We discussed this recently in Reading Infrastructure; there are three approaches to get the data:

  1. add user agents to api.log, get the output into Hadoop somehow (rsync?) - T108618
  2. use the XAnalytics extension to put API identification data into a header, which then gets logged via varnishkafka
  3. build on T106256 to log directly from the API to Kafka

1 is quick and dirty, 2 captures some extra data (like latency) and works for cached requests as well, 3 avoids serializing the data into a string and back, and is probably more robust if we decide to log large amounts of data per request.

The data visualisation part of the task is T108414 (although that's currently focused on counting requests, not users).

For the REST API, we know that Google is fetching the HTML and data-parsoid for all edited pages as they happen. Another big external client is the Kiwix offline reader updating ZIM HTML dumps of entire wikis, which they then provide to users. We are in contact with both.

Qgil added a comment.Sep 16 2015, 12:50 PM

Just checking, by "active users" we mean products using our Web APIs, not the individual users (people) using these products. "Google is fetching the HTML and data-parsoid for all edited pages" is one user. "Kiwix offline reader updating ZIM HTML dumps of entire wikis" is another user. The Wikipedia app for Windows is another user.

It would be great to have an idea of the traffic generated by each user and the specific APIs they use. We would like to count whether there are more or less products using our APIs on a monthly basis, whether their activity increases or decreases, and whether their use of our APIs corresponds with our expectations, documentation, and developer plans.

It is understood that we cannot 100% representative data. One API call by Google or Kiwix might result in an impact to 1000s of users of their cached / stored data, while a modest tool in Labs might be generating a lot of traffic with every click of users. This is fine, our assumption is that more products should eventually lead to more users and more traffic with a higher diversity of activity.

We discussed this recently in Reading Infrastructure; there are three approaches to get the data:

  1. add user agents to api.log, get the output into Hadoop somehow (rsync?) - T108618
  2. use the XAnalytics extension to put API identification data into a header, which then gets logged via varnishkafka
  3. build on T106256 to log directly from the API to Kafka

    1 is quick and dirty, 2 captures some extra data (like latency) and works for cached requests as well, 3 avoids serializing the data into a string and back, and is probably more robust if we decide to log large amounts of data per request.

For CirrusSearch we had a similar question of how to log, store and process thousands of events per second. We chose option 3 and i've implemented T106256 which will deploy with next weeks train. This makes it relatively easy to setup a pipeline of data from the mediawiki application servers into kafka and eventually hadoop. I put together two patches[1][2] which are a proof of concept of sending structured api logging to kafka with the new logging code.

[1] https://gerrit.wikimedia.org/r/#/c/240614/
[2] https://gerrit.wikimedia.org/r/#/c/240617/

kevinator moved this task from Incoming to Backlog on the Analytics-Backlog board.
kevinator moved this task from Backlog to Radar on the Analytics-Backlog board.
Elitre added a subscriber: Elitre.Sep 29 2015, 4:54 PM
Qgil moved this task from Backlog to Doing on the DevRel-September-2015 board.Sep 29 2015, 6:19 PM
Tgr added a comment.Oct 1 2015, 1:33 AM

I think this is now pretty close to actually happening. Could someone from Developer-Advocacy update the "Metrics requested" section to contain actually measurable things? We cannot measure the number of users (also what's "active"?) but we can measure unique IPs or useragents or whatever.

For now, the most precise description seems to be the one given by @Halfak which mentions API module(s), user agent, OAuth consumer ID, user ID, central user ID, request type as the things we log. Do we actually have a use case for user ID / central user ID? Seems somewhat privacy sensitive so it should only be collected if we really need it. Also, will OAuth consumer ID (or preferably name + version if we can easily fetch that, I suppose) give any extra information not given by the user agent? Presumably OAuth tools would set a sane UA.

Qgil updated the task description. (Show Details)Oct 1 2015, 8:46 AM
Qgil added a comment.Oct 1 2015, 8:51 AM

I tried, check the description. I have added the reasoning on each item, in order to help others help us asking for sensible data.

It would be useful to have a separation between data from Labs and data from third parties. The Labs and Strategic Partnerships teams probably agree?

It would be definitely useful to mix data from Action API, RESTBase, and Wikidata query service if that makes technical sense at all. What I mean is that we don't really care which technology is being used, we care about who is using our APIs and what for.

PS: it looks like Reading-Infrastructure-Team-Backlog is taking this task officially?

bd808 added a project: Epic.Oct 1 2015, 3:17 PM

I tried, check the description. I have added the reasoning on each item, in order to help others help us asking for sensible data.

It would be useful to have a separation between data from Labs and data from third parties. The Labs and Strategic Partnerships teams probably agree?

I think that the current metrics (UA by IP classification, requests by IP classification, actions) are things we can actually figure out how to count.

Using IP blocks we should be able to classify requests as "Internal" (coming from WMF production hosts like the Parsoid servers), "Labs" (coming from the grid engine servers or other Labs hosts), or "External" (anywhere else).

It would be definitely useful to mix data from Action API, RESTBase, and Wikidata query service if that makes technical sense at all. What I mean is that we don't really care which technology is being used, we care about who is using our APIs and what for.

PS: it looks like Reading-Infrastructure-Team-Backlog is taking this task officially?

Reading-Infrastructure-Team-Backlog will commit to helping get useful data from the Action API into Hadoop by publishing detailed per-request data to a Kafka topic. Actually building reports/dashboards from there will not be something we commit to. We would be glad to work with whoever builds those reports to tweak the data we send from MediaWiki into Hadoop.

bd808 claimed this task.Oct 1 2015, 3:19 PM

@Tnegrin has suggested that I act as product manager for this epic to help coordinate work between various teams and try to keep it from stalling out. I'll spend some time over the next week or so trying to make some useful subtasks and talking to various teams to see if they can provide assistance over the October-December time frame (WMF 2015/2016-Q2).

Qgil awarded a token.Oct 1 2015, 3:23 PM
Ottomata added a subscriber: Ottomata.EditedOct 2 2015, 6:00 PM

@bd808, @EBernhardson I believe is producing binary Avro to Kafka. At this time, I advise that you to stick to JSON. Analytics is working on figuring out how to import the binary data into Hadoop properly, but it is not as easy as we thought it would be. EventBus will in the near term only support producing JSON.

Using an Avro schema is fine, as is using a JSON schema. If you do go with Avro schema (which you are!) you should be able to push the Avro-JSON representation of your data, instead of binary. I'm not sure how that works with the existing Avro+Monolog implementation, but we should make sure that it works.

The avro encoder in php doesn't look to directly support json encoding, but i don't think it needs to either. We would just need to re-use the schema validation, then do a standard json encoding. It seems we might not need binary avro in the php side at all then?

Spage updated the task description. (Show Details)Oct 6 2015, 8:24 PM
bd808 added a comment.Oct 8 2015, 7:37 PM

Counts of errors (T113672) by action and user agent, in order to identify problem areas and proactively reach out to API clients getting errors (DevRel and documentation)

@Spage Do you envision needing this particular subset of the data over a long period of time or would something like the last 30 days or so suffice?

I haven't created a sub-task for it yet, but one of the things we are going to need to nail down for all of this is what the aggregate data tables look like that we roll information up into for historic reporting. We won't be able to keep all of the data around forever both because of storage space needed and because of our data retention policies. I'm thinking that this particular use case is more near-term operational than long-term trend related.

Nuria added a subscriber: Nuria.Oct 8 2015, 9:16 PM

Counts of errors (T113672) by action and user agent, in order to identify problem areas and proactively reach out to API clients getting errors (DevRel and documentation)

If you are thinking http errors the data is present right now on webrequest table (for the last couple months) so you can get this information as of now, for application errors you will need to wait until those are published.

dr0ptp4kt moved this task from Backlog to Current Quarter on the Reading-Admin board.
Addshore added a subscriber: Addshore.

If you are thinking http errors the data is present right now on webrequest table

The HTTP response to a failing API request like https://www.mediawiki.org/w/api.php?action=parse&page=Wrong is 200 OK. As T113672 says, the API response has a fantastic HTTP header:

MediaWiki-API-Error: missingtitle

Could we add this MediaWiki-API-Error header to the webrequest table? Should I file a separate bug?

Nuria added a comment.Oct 26 2015, 2:28 PM

Could we add this MediaWiki-API-Error header to the webrequest table? Should I file a separate bug?

We can add this value to x-Analytics header but it will be worth thinking whether we want a global value to signal "health-of-api-request" .Like, for example, does this header capture a 503?
I do not know much about api return codes so to capture the precise information I think it will be best if you open a new ticket and assign it to API developers (while tagging it with analytics-backlog so we also see it) and we can work together on it.

Could we add this MediaWiki-API-Error header to the webrequest table? Should I file a separate bug?

We can add this value to x-Analytics header but it will be worth thinking whether we want a global value to signal "health-of-api-request" .Like, for example, does this header capture a 503?

A 503 from the API means a PHP fatal or something so badly screwed up that exceptions aren't being caught. I have no idea what "it will be worth thinking whether we want a global value to signal 'health-of-api-request'" might mean.

I do not know much about api return codes so to capture the precise information I think it will be best if you open a new ticket and assign it to API developers (while tagging it with analytics-backlog so we also see it) and we can work together on it.

I see nothing relevant to the API developers (i.e. me) here that would require a new ticket.

Nuria added a comment.Oct 26 2015, 6:49 PM

I see nothing relevant to the API developers (i.e. me) here that would require a new ticket.

I think a ticket that is concise regarding the header will be more clear that this wide scope ticket.

Now, If you want this done in the near term you can ping us on #wikimedia-analytics and we can help you do the needed changes on varnish so the header is part of x-analytics map. Otherwise the work can go into our backlog but we will not get to it for a bit.

I see nothing relevant to the API developers (i.e. me) here that would require a new ticket.

I think a ticket that is concise regarding the header will be more clear that this wide scope ticket.

That doesn't say why you suggested it be assigned to the API developers, or why you implied that the API developers would necessarily have anything to do to make it happen with analytics only needing to "see" it.

Nuria added a comment.Oct 26 2015, 8:55 PM

@Spage: added a (sub) task : https://phabricator.wikimedia.org/T116658 to keep things organized.

bd808 moved this task from To Do to In Dev/Progress on the User-bd808 board.Nov 4 2015, 5:24 AM
Ainali removed a subscriber: Ainali.Nov 13 2015, 9:50 PM
Milimetric moved this task from Incoming to Radar on the Analytics board.Jan 12 2016, 7:32 PM
Krinkle removed a subscriber: Krinkle.Mar 22 2016, 7:54 PM
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 20 2016, 12:52 AM
bd808 moved this task from In Dev/Progress to To Do on the User-bd808 board.Aug 1 2016, 11:56 PM
bd808 removed bd808 as the assignee of this task.Sep 30 2016, 10:14 PM

Unlicking this soggy old cookie.

Tgr moved this task from Backlog to Pending on the User-Tgr board.May 3 2017, 9:21 AM
dr0ptp4kt moved this task from Admin to Tracking on the Reading-Admin board.Jul 20 2017, 10:00 PM
Qgil removed a subscriber: Qgil.Aug 28 2018, 9:12 AM