Page MenuHomePhabricator

Impact module: Instrument loading times
Closed, ResolvedPublic

Description

Description
We have a dashboard that provides us information about loading times for the suggested edits module on Special:Homepage.

We will want to implement something similar for the Impact module, so that we can 1) understand the overall load time for desktop/mobile users per wiki and 2) catch performance related regressions after we've deployed the new module to production.

Acceptance Criteria

  1. Should be able to see time-to-interactive for the Impact module on a Grafana chart on the Special:Homepage grafana dashboard
  2. ...
Completion checklist

Functionality

  • The patches have been code reviewed and merged
  • The task passes its acceptance criteria

Engineering

  • There are existing and passing unit/integration tests
  • Tests for every involved patch should pass
  • Coverage for every involved project should have improved or stayed the same

Design & QA

  • If the task is UX/Design related: it must be reviewed and approved by the UX/Design team
  • Must be reviewed and approved by Quality Assurance.

Documentation

  • Related and updated documentation done where necessary

Event Timeline

kostajh renamed this task from Instrument impact module loading to Impact module: Instrument loading times.Jun 16 2022, 9:59 AM

We can also instrument UserImpactHandler for measuring API response times and UserImpactCompute service (for calculating impact data)

Change 856010 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] UserImpactLookup: Instrument getExpensiveUserImpact

https://gerrit.wikimedia.org/r/856010

Change 856011 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] UserImpactHandler: Instrument API response time

https://gerrit.wikimedia.org/r/856011

Change 856011 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] UserImpactHandler: Instrument API response time

https://gerrit.wikimedia.org/r/856011

Change 856010 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] UserImpactLookup: Instrument getExpensiveUserImpact

https://gerrit.wikimedia.org/r/856010

Change 858456 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] WIP: NewImpact: Instrument app loading

https://gerrit.wikimedia.org/r/858456

Things other than loading times that might be worth keeping track of:

  • Cache hit ratio - it's reflected in UserImpactHandler.run but still might help with diagnostics to track directly.
  • Maybe how often we run into the ping limiter in UserImpactHandler?
  • Job queue characteristics like average wait time? (Probably exists somewhere, just need to find it.) One potential performance complication could be when the job queue gets clogged up (I don't think we use a dedicated queue, maybe we should? not sure about the policies around that) so RefreshUserImpactJob writes get delayed. (Alternatively, maybe the refresh in the API handler should be a deferred.)
  • Reliability of PageViewInfo (which sadly does not have any monitoring of its own) - how often do we get truncated pageview data because of quota overruns?

Change 860016 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] UserImpactHandler: Instrument cache hit and misses

https://gerrit.wikimedia.org/r/860016

Things other than loading times that might be worth keeping track of:

  • Cache hit ratio - it's reflected in UserImpactHandler.run but still might help with diagnostics to track directly.
  • Maybe how often we run into the ping limiter in UserImpactHandler?

Those should be accounted for in https://gerrit.wikimedia.org/r/860016.

  • Job queue characteristics like average wait time? (Probably exists somewhere, just need to find it.) One potential performance complication could be when the job queue gets clogged up (I don't think we use a dedicated queue, maybe we should? not sure about the policies around that) so RefreshUserImpactJob writes get delayed.

I think we have that data here? https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1

(Alternatively, maybe the refresh in the API handler should be a deferred.)

That's probably a good idea, though, it used to be the case that we were not supposed to do writes in a GET request, even in a deferred, due to the multi-DC work. I think that changed, though?

  • Reliability of PageViewInfo (which sadly does not have any monitoring of its own) - how often do we get truncated pageview data because of quota overruns?

Maybe we should add that in PageViewInfo? In either case I think that should be a separate task, if we want to instrument this.

kostajh changed the task status from Open to In Progress.Nov 23 2022, 1:40 PM

Those should be accounted for in https://gerrit.wikimedia.org/r/860016.

Thanks!

I think we have that data here? https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1

Thanks for finding. Might be worth duplicating some of it IMO because individual Grafana metrics cannot be deeplinked and manually selecting the right job out of dozens is very time-consuming.

That's probably a good idea, though, it used to be the case that we were not supposed to do writes in a GET request, even in a deferred, due to the multi-DC work. I think that changed, though?

I think it's not strictly forbidden but discouraged, as deferreds are less time-sensitive but you still end up with a cross-datacenter DB connection. Plus we'd have to silence the logspam somehow.

Maybe we should add that in PageViewInfo?

It's hard to tell at that level if it's a problem. E.g. the action API wrapper around it will just stop when it encounters the first title with no data, and the client can use standard API continuation to get more data, so with most MediaWiki-specific API libraries the developer doesn't even need to pay extra attention and will get all the data eventually.

Change 860016 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] UserImpactHandler: Instrument cache hit and misses

https://gerrit.wikimedia.org/r/860016

That's probably a good idea, though, it used to be the case that we were not supposed to do writes in a GET request, even in a deferred, due to the multi-DC work. I think that changed, though?

I think it's not strictly forbidden but discouraged, as deferreds are less time-sensitive but you still end up with a cross-datacenter DB connection. Plus we'd have to silence the logspam somehow.

I filed T323747: NewImpact: Use POST request for fetching impact data for this.

Change 858456 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] NewImpact: Instrument app loading

https://gerrit.wikimedia.org/r/858456

Those should be accounted for in https://gerrit.wikimedia.org/r/860016.

Thanks!

I think we have that data here? https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1

Thanks for finding. Might be worth duplicating some of it IMO because individual Grafana metrics cannot be deeplinked and manually selecting the right job out of dozens is very time-consuming.

That's probably a good idea, though, it used to be the case that we were not supposed to do writes in a GET request, even in a deferred, due to the multi-DC work. I think that changed, though?

I think it's not strictly forbidden but discouraged, as deferreds are less time-sensitive but you still end up with a cross-datacenter DB connection. Plus we'd have to silence the logspam somehow.

Maybe we should add that in PageViewInfo?

It's hard to tell at that level if it's a problem. E.g. the action API wrapper around it will just stop when it encounters the first title with no data, and the client can use standard API continuation to get more data, so with most MediaWiki-specific API libraries the developer doesn't even need to pay extra attention and will get all the data eventually.

@Tgr I'm moving this to QA, but if you'd like, please file follow-up tasks for the job queue and PageViewInfo instrumentation.