Page MenuHomePhabricator

Metrics about the use of the Wikimedia web APIs
Closed, ResolvedPublic

Description

The mission of the Wikimedia Foundation is to empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally.

In order to know how well we are doing with our mission, we invest a lot of attention in metrics about visits and contributions to Wikimedia sites. However, our web APIs also contribute to this mission by enabling community and third party developers to spread and improve our free knowledge via software. So far, we don't seem to have metrics about the use of our web APIs, and we don't seem to have any ongoing or planned initiative to address this problem.

Let's define what API metrics we need, and what would it take to produce them.

Reasoning

The reasoning for this request is simple: we are measuring how many readers and editors we have based on the activity performed directly on Wikimedia websites. However, our web APIs allow to access, create, and modify Wikimedia content through other ways, andwe are not measuring the volume, percentages, and trends of these activities.

The goal is to measure the activities performed through our web APIs in order to know

  • better numbers about readership and contributors, counting web + API
  • any numbers about the use of our web APIs, so we can plan better to improve them and increase their impact on readership and contributions.

It would be useful to have an identification of domains, at least to have data about the activity originated in Labs (community tools) versus the activity originated in third party servers. All the better if we can identify the main third party services using our web APIs.

Metrics requested

Specific metrics requested, and the stakeholders requesting them:

  • Number of user agents coming from Labs or third party services, on a monthly basis + all time (DevRel, to check whether our APIs are increasing adoption)
  • Volume of API requests coming from Labs or third party services, on a monthly basis (DevRel, to check the trend of usage of our APIs)
  • Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis + all time (DevRel, to help identifying the services making intensive use of our APIs)
  • Ranking of most requested actions/parameters, on a monthly basis + all time (DevRel, to help identifying usage of our APIs and check against our documentation, APIs we should promote...)
  • Counts of errors (T113672) by action and user agent, in order to identify problem areas and proactively reach out to API clients getting errors (DevRel and documentation)

Related Objects

StatusSubtypeAssignedTask
ResolvedQgil
ResolvedKeegan
DeclinedNone
ResolvedQgil
ResolvedQgil
InvalidNone
InvalidNone
ResolvedNone
DeclinedQgil
ResolvedQgil
ResolvedNone
Resolvedbd808
ResolvedTgr
OpenNone
ResolvedArielGlenn
Resolvedbd808
ResolvedNone
ResolvedDzahn
Resolvedbd808
ResolvedJoe
ResolvedJoe
ResolvedJoe
ResolvedJAllemandou
ResolvedOttomata
Resolvedbd808
Resolvedbd808
OpenNone
DeclinedNone
Resolvedbd808
OpenNone
DeclinedNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think this is now pretty close to actually happening. Could someone from {#DevRel} update the "Metrics requested" section to contain actually measurable things? We cannot measure the number of users (also what's "active"?) but we can measure unique IPs or useragents or whatever.

For now, the most precise description seems to be the one given by @Halfak which mentions API module(s), user agent, OAuth consumer ID, user ID, central user ID, request type as the things we log. Do we actually have a use case for user ID / central user ID? Seems somewhat privacy sensitive so it should only be collected if we really need it. Also, will OAuth consumer ID (or preferably name + version if we can easily fetch that, I suppose) give any extra information not given by the user agent? Presumably OAuth tools would set a sane UA.

I tried, check the description. I have added the reasoning on each item, in order to help others help us asking for sensible data.

It would be useful to have a separation between data from Labs and data from third parties. The Labs and Strategic Partnerships teams probably agree?

It would be definitely useful to mix data from Action API, RESTBase, and Wikidata query service if that makes technical sense at all. What I mean is that we don't really care which technology is being used, we care about who is using our APIs and what for.

PS: it looks like #Reading-Infrastructure-Team is taking this task officially?

I tried, check the description. I have added the reasoning on each item, in order to help others help us asking for sensible data.

It would be useful to have a separation between data from Labs and data from third parties. The Labs and Strategic Partnerships teams probably agree?

I think that the current metrics (UA by IP classification, requests by IP classification, actions) are things we can actually figure out how to count.

Using IP blocks we should be able to classify requests as "Internal" (coming from WMF production hosts like the Parsoid servers), "Labs" (coming from the grid engine servers or other Labs hosts), or "External" (anywhere else).

It would be definitely useful to mix data from Action API, RESTBase, and Wikidata query service if that makes technical sense at all. What I mean is that we don't really care which technology is being used, we care about who is using our APIs and what for.

PS: it looks like #Reading-Infrastructure-Team is taking this task officially?

#Reading-Infrastructure-Team will commit to helping get useful data from the Action API into Hadoop by publishing detailed per-request data to a Kafka topic. Actually building reports/dashboards from there will not be something we commit to. We would be glad to work with whoever builds those reports to tweak the data we send from MediaWiki into Hadoop.

@Tnegrin has suggested that I act as product manager for this epic to help coordinate work between various teams and try to keep it from stalling out. I'll spend some time over the next week or so trying to make some useful subtasks and talking to various teams to see if they can provide assistance over the October-December time frame (WMF 2015/2016-Q2).

@bd808, @EBernhardson I believe is producing binary Avro to Kafka. At this time, I advise that you to stick to JSON. Analytics is working on figuring out how to import the binary data into Hadoop properly, but it is not as easy as we thought it would be. EventBus will in the near term only support producing JSON.

Using an Avro schema is fine, as is using a JSON schema. If you do go with Avro schema (which you are!) you should be able to push the Avro-JSON representation of your data, instead of binary. I'm not sure how that works with the existing Avro+Monolog implementation, but we should make sure that it works.

The avro encoder in php doesn't look to directly support json encoding, but i don't think it needs to either. We would just need to re-use the schema validation, then do a standard json encoding. It seems we might not need binary avro in the php side at all then?

Counts of errors (T113672) by action and user agent, in order to identify problem areas and proactively reach out to API clients getting errors (DevRel and documentation)

@Spage Do you envision needing this particular subset of the data over a long period of time or would something like the last 30 days or so suffice?

I haven't created a sub-task for it yet, but one of the things we are going to need to nail down for all of this is what the aggregate data tables look like that we roll information up into for historic reporting. We won't be able to keep all of the data around forever both because of storage space needed and because of our data retention policies. I'm thinking that this particular use case is more near-term operational than long-term trend related.

Counts of errors (T113672) by action and user agent, in order to identify problem areas and proactively reach out to API clients getting errors (DevRel and documentation)

If you are thinking http errors the data is present right now on webrequest table (for the last couple months) so you can get this information as of now, for application errors you will need to wait until those are published.

If you are thinking http errors the data is present right now on webrequest table

The HTTP response to a failing API request like https://www.mediawiki.org/w/api.php?action=parse&page=Wrong is 200 OK. As T113672 says, the API response has a fantastic HTTP header:

MediaWiki-API-Error: missingtitle

Could we add this MediaWiki-API-Error header to the webrequest table? Should I file a separate bug?

Could we add this MediaWiki-API-Error header to the webrequest table? Should I file a separate bug?

We can add this value to x-Analytics header but it will be worth thinking whether we want a global value to signal "health-of-api-request" .Like, for example, does this header capture a 503?
I do not know much about api return codes so to capture the precise information I think it will be best if you open a new ticket and assign it to API developers (while tagging it with analytics-backlog so we also see it) and we can work together on it.

Could we add this MediaWiki-API-Error header to the webrequest table? Should I file a separate bug?

We can add this value to x-Analytics header but it will be worth thinking whether we want a global value to signal "health-of-api-request" .Like, for example, does this header capture a 503?

A 503 from the API means a PHP fatal or something so badly screwed up that exceptions aren't being caught. I have no idea what "it will be worth thinking whether we want a global value to signal 'health-of-api-request'" might mean.

I do not know much about api return codes so to capture the precise information I think it will be best if you open a new ticket and assign it to API developers (while tagging it with analytics-backlog so we also see it) and we can work together on it.

I see nothing relevant to the API developers (i.e. me) here that would require a new ticket.

I see nothing relevant to the API developers (i.e. me) here that would require a new ticket.

I think a ticket that is concise regarding the header will be more clear that this wide scope ticket.

Now, If you want this done in the near term you can ping us on #wikimedia-analytics and we can help you do the needed changes on varnish so the header is part of x-analytics map. Otherwise the work can go into our backlog but we will not get to it for a bit.

I see nothing relevant to the API developers (i.e. me) here that would require a new ticket.

I think a ticket that is concise regarding the header will be more clear that this wide scope ticket.

That doesn't say why you suggested it be assigned to the API developers, or why you implied that the API developers would necessarily have anything to do to make it happen with analytics only needing to "see" it.

bd808 removed bd808 as the assignee of this task.Sep 30 2016, 10:14 PM

Unlicking this soggy old cookie.

bd808 claimed this task.

Closing as "resolved" rather than "declined". The initial goals of this task were ambitious and it turns out also somewhat in conflict with technical and legal realities in the Wikimedia MediaWiki deployments. We learned quite a bit, but really can not ever reach the initial stated goal without either requiring API tokens which can be counted or using User-Agent + IP information as a rough substitute. 4.5 years seems long enough to leave an unsolvable task open. ;)

The same result and rationale apply here.