Page MenuHomePhabricator

Add Image: Track API performance
Open, HighPublic

Description

There are no performance metrics for the Image Recommendation API, due to limitations of the WMCS infrastructure, so we should track response times on our side.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Tgr renamed this task from Add Image: track API performance to Add Image: Track API performance.Nov 3 2021, 5:17 AM
Tgr edited projects, added Growth-Team (Current Sprint); removed Growth-Team.
Tgr moved this task from Incoming to Ready for Development on the Growth-Team (Current Sprint) board.
kostajh triaged this task as High priority.Nov 3 2021, 1:44 PM

Change 738277 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[mediawiki/extensions/GrowthExperiments@master] Add Image: Track API performance

https://gerrit.wikimedia.org/r/738277

@kostajh Besides the call for the API get request ServiceImageRecommendationProvider::processApiResponseData is called from SubpageImageRecommendationProvider::createRecommendation (see code here). Do we need to track performance for this creation?

Change 738277 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Add Image: Track API performance

https://gerrit.wikimedia.org/r/738277

Moving back to "In progress" for adding the dashboard, which we can do after the code is in production wikis.

I added new a row Image recommendation service and created 3 panels in grafana dashboard based on existing ones. All feedback on the naming and metrics aggregation is welcome. @Tgr you mentioned a way to aggregate both get and processApiResponseData; On the rate panel I added a "total" aggregation but I'm not 100% sure how you were thinking of displaying this.

I admit I didn't quite think that through. For count/rate you can just add up series, although in this case that doesn't make sense (get and processApiReponseData both happen once per request so it's enough to chart one of them). For percentiles, every pixel of the chart is the 75th-percentile value for that stat within whatever time slice the pixel corresponds to, and theoretically I suppose there is no way to recover p75(get + processApiReponseData) from p75(get) and p75(processApiReponseData)? But probably just doing sumSeries(XXX.*.p75) will be close enough for comfort.

@Tgr I updated the panels. Maybe we should add some documentation in the Chore list checklist. What's a reasonable upper bound for the API requests?

What's a reasonable upper bound for the API requests?

Currently requests takes a few hundred milliseconds. I'm not sure about reasonable, but if it significantly increases, we should at least investigate.

The dashboard shows unrealistically small p99 values - I wonder if it counts time periods with not requests as zero, pulling the average down. Grafana can differentiate between zero and null values although I usually have a hard time getting that part of the configuration right.