"Mountain of Wealth" token, awarded by Mholloway."Love" token, awarded by Addshore."Mountain of Wealth" token, awarded by Qgil."Love" token, awarded by Aklapper."Mountain of Wealth" token, awarded by jayvdb.

Description

The mission of the Wikimedia Foundation is to empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally.

In order to know how well we are doing with our mission, we invest a lot of attention in metrics about visits and contributions to Wikimedia sites. However, our web APIs also contribute to this mission by enabling community and third party developers to spread and improve our free knowledge via software. So far, we don't seem to have metrics about the use of our web APIs, and we don't seem to have any ongoing or planned initiative to address this problem.

Let's define what API metrics we need, and what would it take to produce them.

Reasoning

The reasoning for this request is simple: we are measuring how many readers and editors we have based on the activity performed directly on Wikimedia websites. However, our web APIs allow to access, create, and modify Wikimedia content through other ways, andwe are not measuring the volume, percentages, and trends of these activities.

The goal is to measure the activities performed through our web APIs in order to know

better numbers about readership and contributors, counting web + API
any numbers about the use of our web APIs, so we can plan better to improve them and increase their impact on readership and contributions.

It would be useful to have an identification of domains, at least to have data about the activity originated in Labs (community tools) versus the activity originated in third party servers. All the better if we can identify the main third party services using our web APIs.

Metrics requested

Specific metrics requested, and the stakeholders requesting them:

Number of user agents coming from Labs or third party services, on a monthly basis + all time (DevRel, to check whether our APIs are increasing adoption)
Volume of API requests coming from Labs or third party services, on a monthly basis (DevRel, to check the trend of usage of our APIs)
Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis + all time (DevRel, to help identifying the services making intensive use of our APIs)
Ranking of most requested actions/parameters, on a monthly basis + all time (DevRel, to help identifying usage of our APIs and check against our documentation, APIs we should promote...)
Counts of errors (T113672) by action and user agent, in order to identify problem areas and proactively reach out to API clients getting errors (DevRel and documentation)

Details

	Subject	Repo	Branch	Lines +/-
	Count API and hook calls, with 1:1000 sampling	mediawiki/core	master	+16 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Qgil	T153007 Technical Collaboration annual plan FY2017-18
Resolved	Keegan	T131689 Second iteration of the Technical Collaboration strategy
Declined	None	T926 Engage with established technical communities
Resolved	Qgil	T102790 Engineering Community quarterly review (as part of the Community Engagement review)
Resolved	Qgil	T93770 Engineering Community quarterly goals for April-June 2015
Invalid	None	T98348 Implement the Wikimedia Foundation Call to Action 2015
Invalid	None	T98359 Create spaces for future community-led innovations and new knowledge creation
Resolved	None	T98361 Strengthen partnerships with organizations that use or contribute free content, or are aligned with the WMF in the free-knowledge movement
Declined	Qgil	T96013 Identify Wikimedia's top technical partners
Resolved	Qgil	T97283 Plan to focus on the Developer audience
Resolved	None	T114017 Map current use of Wikimedia web APIs
Resolved	bd808	T102079 Metrics about the use of the Wikimedia web APIs
Resolved	Tgr	T106457 Sample StatsD requests
Open	None	T108414 Load API request count and latency data from Hadoop to a dashboard
Resolved	ArielGlenn	T108417 stat1002 access for tgr
Resolved	bd808	T108618 Publish detailed Action API request information to Hadoop
Resolved	None	T114733 Determine proper encoding for structured log data sent to Kafka by MediaWiki
Resolved	Dzahn	T115548 Requesting access to analytics-privatedata-users for Bryan Davis
Resolved	bd808	T118592 Create user defined function to classify network origin of an IP address
Resolved	Joe	T125084 MediaWiki monolog doesn't handle Kafka failures gracefully
Resolved	Joe	T119637 Update HHVM package to recent release
Resolved	Joe	T129467 HHVM 3.12 has a race-condition when starting up
Resolved	JAllemandou	T129886 Create wmf_raw.ApiAction table
Resolved	Ottomata	T129889 Create mediawiki_ApiAction Kafka topic
Resolved	bd808	T113672 api.log does not indicate errors and exceptions
Resolved	bd808	T116065 Design aggregate tables to drive Action API reports
Open	None	T116658 Add Application errors for Mediawiki API to x-analytics
Declined	None	T122245 REST API entry point web request statistics at the Varnish level
Resolved	bd808	T132283 Determine which Action API parameters to whitelist/blacklist for action_param_hourly aggregate table
Open	None	T137321 Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables
Declined	None	T154912 Is User-Agent data PII when associated with Action API requests?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Qgil moved this task from Backlog to Doing on the DevRel-September-2015 board.Sep 29 2015, 6:19 PM

Tgr mentioned this in T113672: api.log does not indicate errors and exceptions.Oct 1 2015, 1:26 AM

I think this is now pretty close to actually happening. Could someone from {#DevRel} update the "Metrics requested" section to contain actually measurable things? We cannot measure the number of users (also what's "active"?) but we can measure unique IPs or useragents or whatever.

For now, the most precise description seems to be the one given by @Halfak which mentions API module(s), user agent, OAuth consumer ID, user ID, central user ID, request type as the things we log. Do we actually have a use case for user ID / central user ID? Seems somewhat privacy sensitive so it should only be collected if we really need it. Also, will OAuth consumer ID (or preferably name + version if we can easily fetch that, I suppose) give any extra information not given by the user agent? Presumably OAuth tools would set a sane UA.

Qgil updated the task description. (Show Details)Oct 1 2015, 8:46 AM

I tried, check the description. I have added the reasoning on each item, in order to help others help us asking for sensible data.

It would be useful to have a separation between data from Labs and data from third parties. The Labs and Strategic Partnerships teams probably agree?

It would be definitely useful to mix data from Action API, RESTBase, and Wikidata query service if that makes technical sense at all. What I mean is that we don't really care which technology is being used, we care about who is using our APIs and what for.

PS: it looks like #Reading-Infrastructure-Team is taking this task officially?

In T102079#1692344, @Qgil wrote:

I tried, check the description. I have added the reasoning on each item, in order to help others help us asking for sensible data.

It would be useful to have a separation between data from Labs and data from third parties. The Labs and Strategic Partnerships teams probably agree?

I think that the current metrics (UA by IP classification, requests by IP classification, actions) are things we can actually figure out how to count.

Using IP blocks we should be able to classify requests as "Internal" (coming from WMF production hosts like the Parsoid servers), "Labs" (coming from the grid engine servers or other Labs hosts), or "External" (anywhere else).

It would be definitely useful to mix data from Action API, RESTBase, and Wikidata query service if that makes technical sense at all. What I mean is that we don't really care which technology is being used, we care about who is using our APIs and what for.

PS: it looks like #Reading-Infrastructure-Team is taking this task officially?

#Reading-Infrastructure-Team will commit to helping get useful data from the Action API into Hadoop by publishing detailed per-request data to a Kafka topic. Actually building reports/dashboards from there will not be something we commit to. We would be glad to work with whoever builds those reports to tweak the data we send from MediaWiki into Hadoop.

@Tnegrin has suggested that I act as product manager for this epic to help coordinate work between various teams and try to keep it from stalling out. I'll spend some time over the next week or so trying to make some useful subtasks and talking to various teams to see if they can provide assistance over the October-December time frame (WMF 2015/2016-Q2).

bd808 added a project: Reading-Infrastructure-Team-Old (Don't use).Oct 1 2015, 3:20 PM

Qgil awarded a token.Oct 1 2015, 3:23 PM

bd808 mentioned this in T114443: EventBus MVP.Oct 2 2015, 5:47 PM

@bd808, @EBernhardson I believe is producing binary Avro to Kafka. At this time, I advise that you to stick to JSON. Analytics is working on figuring out how to import the binary data into Hadoop properly, but it is not as easy as we thought it would be. EventBus will in the near term only support producing JSON.

Using an Avro schema is fine, as is using a JSON schema. If you do go with Avro schema (which you are!) you should be able to push the Avro-JSON representation of your data, instead of binary. I'm not sure how that works with the existing Avro+Monolog implementation, but we should make sure that it works.

The avro encoder in php doesn't look to directly support json encoding, but i don't think it needs to either. We would just need to re-use the schema validation, then do a standard json encoding. It seems we might not need binary avro in the php side at all then?

bd808 mentioned this in T114733: Determine proper encoding for structured log data sent to Kafka by MediaWiki.Oct 6 2015, 3:27 AM

• Spage updated the task description. (Show Details)Oct 6 2015, 8:24 PM

• csteipp mentioned this in T113817: Add request_id to webrequest logs as well as other event records ingested into Hadoop.Oct 8 2015, 12:13 AM

Counts of errors (T113672) by action and user agent, in order to identify problem areas and proactively reach out to API clients getting errors (DevRel and documentation)

@Spage Do you envision needing this particular subset of the data over a long period of time or would something like the last 30 days or so suffice?

I haven't created a sub-task for it yet, but one of the things we are going to need to nail down for all of this is what the aggregate data tables look like that we roll information up into for historic reporting. We won't be able to keep all of the data around forever both because of storage space needed and because of our data retention policies. I'm thinking that this particular use case is more near-term operational than long-term trend related.

Counts of errors (T113672) by action and user agent, in order to identify problem areas and proactively reach out to API clients getting errors (DevRel and documentation)

If you are thinking http errors the data is present right now on webrequest table (for the last couple months) so you can get this information as of now, for application errors you will need to wait until those are published.

bd808 mentioned this in T115548: Requesting access to analytics-privatedata-users for Bryan Davis.Oct 15 2015, 12:24 AM

bd808 mentioned this in T116065: Design aggregate tables to drive Action API reports.Oct 20 2015, 6:34 PM

dr0ptp4kt added a project: Reading-Admin.Oct 21 2015, 7:02 PM

dr0ptp4kt moved this task from Backlog to Current Quarter on the Reading-Admin board.

Addshore awarded a token.Oct 23 2015, 8:05 PM

Addshore subscribed.

In T102079#1713841, @Nuria wrote:

If you are thinking http errors the data is present right now on webrequest table

The HTTP response to a failing API request like https://www.mediawiki.org/w/api.php?action=parse&page=Wrong is 200 OK. As T113672 says, the API response has a fantastic HTTP header:

MediaWiki-API-Error: missingtitle

Could we add this MediaWiki-API-Error header to the webrequest table? Should I file a separate bug?

Could we add this MediaWiki-API-Error header to the webrequest table? Should I file a separate bug?

We can add this value to x-Analytics header but it will be worth thinking whether we want a global value to signal "health-of-api-request" .Like, for example, does this header capture a 503?
I do not know much about api return codes so to capture the precise information I think it will be best if you open a new ticket and assign it to API developers (while tagging it with analytics-backlog so we also see it) and we can work together on it.

In T102079#1753045, @Nuria wrote:

Could we add this MediaWiki-API-Error header to the webrequest table? Should I file a separate bug?

We can add this value to x-Analytics header but it will be worth thinking whether we want a global value to signal "health-of-api-request" .Like, for example, does this header capture a 503?

A 503 from the API means a PHP fatal or something so badly screwed up that exceptions aren't being caught. I have no idea what "it will be worth thinking whether we want a global value to signal 'health-of-api-request'" might mean.

I do not know much about api return codes so to capture the precise information I think it will be best if you open a new ticket and assign it to API developers (while tagging it with analytics-backlog so we also see it) and we can work together on it.

I see nothing relevant to the API developers (i.e. me) here that would require a new ticket.

I see nothing relevant to the API developers (i.e. me) here that would require a new ticket.

I think a ticket that is concise regarding the header will be more clear that this wide scope ticket.

Now, If you want this done in the near term you can ping us on #wikimedia-analytics and we can help you do the needed changes on varnish so the header is part of x-analytics map. Otherwise the work can go into our backlog but we will not get to it for a bit.

In T102079#1754485, @Nuria wrote:

I see nothing relevant to the API developers (i.e. me) here that would require a new ticket.

I think a ticket that is concise regarding the header will be more clear that this wide scope ticket.

That doesn't say why you suggested it be assigned to the API developers, or why you implied that the API developers would necessarily have anything to do to make it happen with analytics only needing to "see" it.

@Spage: added a (sub) task : https://phabricator.wikimedia.org/T116658 to keep things organized.

bd808 added a project: User-bd808.Nov 4 2015, 5:23 AM

bd808 moved this task from To Do to In Dev/Progress on the User-bd808 board.Nov 4 2015, 5:24 AM

Ainali unsubscribed.Nov 13 2015, 9:50 PM

bd808 mentioned this in T88267: Metrics about the Wikimedia APIs usage.Dec 9 2015, 6:04 PM

Qgil merged a task: T88267: Metrics about the Wikimedia APIs usage.Dec 28 2015, 8:28 PM

Qgil added subscribers: Milimetric, StudiesWorld.

• GWicke added a subtask: T122245: REST API entry point web request statistics at the Varnish level.Dec 28 2015, 9:37 PM

transonlohk subscribed.Jan 7 2016, 12:22 AM

Qgil moved this task from October - December 2015 to Team radar on the Developer-Advocacy board.Jan 12 2016, 7:28 AM

Qgil mentioned this in T114017: Map current use of Wikimedia web APIs.Jan 12 2016, 3:40 PM

• ggellerman edited projects, added Analytics; removed Analytics-Backlog.Jan 12 2016, 7:32 PM

Milimetric moved this task from Incoming to Radar on the Analytics board.Jan 12 2016, 7:32 PM

bd808 moved this task from Backlog to In Dev/Progress on the Reading-Infrastructure-Team-Old (Don't use) board.Jan 25 2016, 5:37 PM

leila removed a project: Research.Jan 28 2016, 11:26 PM

bd808 closed subtask T113672: api.log does not indicate errors and exceptions as Resolved.Feb 29 2016, 6:19 PM

dr0ptp4kt moved this task from Current Quarter to Admin on the Reading-Admin board.Mar 14 2016, 7:47 PM

bd808 removed a project: Patch-For-Review.Mar 22 2016, 7:37 PM

Krinkle unsubscribed.Mar 22 2016, 7:54 PM

bd808 closed subtask T108618: Publish detailed Action API request information to Hadoop as Resolved.Apr 6 2016, 11:14 PM

transonlohk unsubscribed.Apr 7 2016, 7:33 AM

bd808 created subtask T132283: Determine which Action API parameters to whitelist/blacklist for action_param_hourly aggregate table.Apr 10 2016, 5:48 AM

• Mholloway unsubscribed.Apr 11 2016, 1:43 PM

bd808 closed subtask T132283: Determine which Action API parameters to whitelist/blacklist for action_param_hourly aggregate table as Resolved.Apr 20 2016, 12:52 AM

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 20 2016, 12:52 AM

bd808 closed subtask T116065: Design aggregate tables to drive Action API reports as Resolved.Jun 8 2016, 3:44 PM

bd808 added a subtask: T137321: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables.Jun 8 2016, 3:57 PM

bd808 moved this task from In Dev/Progress to To Do on the User-bd808 board.Aug 1 2016, 11:56 PM

Unlicking this soggy old cookie.

bd808 removed a project: User-bd808.Dec 30 2016, 11:12 PM

bd808 mentioned this in T154912: Is User-Agent data PII when associated with Action API requests?.Jan 9 2017, 6:01 PM

bd808 created subtask T154912: Is User-Agent data PII when associated with Action API requests?.

• ZhouZ subscribed.Jan 9 2017, 6:16 PM

Tgr added a project: User-Tgr.May 2 2017, 2:36 PM

Tgr moved this task from Backlog to Pending on the User-Tgr board.May 3 2017, 9:21 AM

• NHarateh_WMF added a project: Product-Infrastructure-Team-Backlog-Deprecated.May 3 2017, 6:19 PM

• Fjalapeno removed a project: Reading-Infrastructure-Team-Old (Don't use).Jun 7 2017, 3:04 PM

• Fjalapeno moved this task from Needs triage to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.Jul 13 2017, 2:28 PM

dr0ptp4kt moved this task from Admin to Tracking on the Reading-Admin board.Jul 20 2017, 10:00 PM

Addshore changed the status of subtask T137321: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables from Open to Stalled.Aug 28 2018, 7:34 AM

Qgil unsubscribed.Aug 28 2018, 9:12 AM

Tgr mentioned this in T108414: Load API request count and latency data from Hadoop to a dashboard.Feb 27 2019, 9:13 PM

bd808 mentioned this in T221161: API keys.Apr 19 2019, 7:57 PM

EvanProdromou subscribed.May 2 2019, 2:28 PM

Tgr removed a project: User-Tgr.May 20 2019, 11:56 AM

• Mholloway awarded a token.May 20 2019, 12:13 PM

• Mholloway rescinded a token.

• Mholloway awarded a token.

• Mholloway subscribed.

• Pchelolo closed subtask T122245: REST API entry point web request statistics at the Varnish level as Declined.Jul 31 2019, 10:44 PM

Restricted Application added a project: Platform Engineering. · View Herald TranscriptJul 31 2019, 10:44 PM

WDoranWMF removed a project: Platform Engineering.Aug 1 2019, 1:12 PM

• Jcross closed subtask T154912: Is User-Agent data PII when associated with Action API requests? as Declined.Mar 12 2020, 9:51 PM

In T114017#6023382, @bd808 wrote:

Closing as "resolved" rather than "declined". The initial goals of this task were ambitious and it turns out also somewhat in conflict with technical and legal realities in the Wikimedia MediaWiki deployments. We learned quite a bit, but really can not ever reach the initial stated goal without either requiring API tokens which can be counted or using User-Agent + IP information as a rough substitute. 4.5 years seems long enough to leave an unsolvable task open. ;)

The same result and rationale apply here.

TerraCodes unsubscribed.Apr 2 2020, 7:06 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM