Page MenuHomePhabricator

Performance review of Real Time Preview
Closed, ResolvedPublic

Description

This request is being filed in advance, honoring the timelines set forth at mw:Wikimedia Performance Team/Performance Review. Please note that this task will be updated as more code is written.

Description

Community-Tech is currently developing a new feature called Real Time Preview (not to be confused with Live Preview). The basic functionality is to continually make a request to the action=parse API just like Live Preview, but debounced to keyboard input like DiscussionTools has, and with some performance safeguards.

Preview environment

For now, this can be tested on Patch Demo at https://patchdemo.wmflabs.org/wikis/85ef14e849. It will soon be deployed to the Beta Cluster and Test Wikipedia.

You can test the feature by editing any article and clicking on the "Preview" button in the top-right of the editing window. The feature is only available to WikiEditor (the 2010 wikitext editor).

Which code to review

We spoke with a Performance engineer over email and were informed we didn't need to have all the code up for review yet. The performance safeguards that will be added soon (T302282) include:

  • Ensure there's no more than one in-flight request to the API at any time
  • Average the response time from the server over N requests (perhaps three), and if it is greater than say, 10 seconds, disable Real Time Preview.
  • Don't load Real Time Preview at all for very large pages (say 300K bytes)
  • The debounce time (time after the user stops typing at which we make the action=parse request) is to be determined, but it will be set to something sensible that is certainly no less than a second or two.
  • These thresholds will have corresponding configuration settings and may be adjusted according to how the feature performs once it's tested in a production environment.
Performance assessment
  • What work has been done to ensure the best possible performance of the feature?
    • See T302282 for what is to be done
  • What are likely to be the weak areas (e.g. bottlenecks) of the code in terms of performance?
    • We are only hitting the action=parse API, so in theory without proper safeguards we could see performance bottlenecks from an increase in action=parse requests.
  • Are there potential optimisations that haven't been performed yet?
    • Yes but they will be added soon. See T302282
  • Please list which performance measurements are in place for the feature and/or what you've measured ad-hoc so far.
    • None yet, however we plan to create a Grafana dashboard to monitor the performance of the action=parse endpoint. We invite suggestions on how to best monitor for any performance overhead. We can also add event logging to measure how many users have the feature disabled because of persistent slow requests to action=parse.

Event Timeline

Krinkle triaged this task as Medium priority.

Pencilled in for next quarter starting April 1st. Will be done by @aaron and myself.

Hey Performance Team, we have plans to release the MVP of the feature for feedback to a limited set of users we refer to as "pilot wikis"

These consist of wikis who we've reached out to (example of message we wrote to them) as an opt-out Beta feature meaning it will default to ON for people who use the Wikitext 2010 editor. The list of the pilots is as follows:

  1. Huwiki
  2. Cawiki
  3. Plwiki
  4. Viwiki
  5. Fawiki
  6. Fiwiki
  7. Kowiki

We plan on doing this April 17th. We know the quarter begins on April 1st and we've implemented the safeguards as @MusikAnimal
mentioned! Let us know if this causes any concern @Krinkle Thanks so much :D

Hey @NRodriguez , I see that you have done a great job doing the perf self-assessmen, thanks for that.

We wonder if the Grafana dashboard mentioned in the description was created. We are collecting data on how teams are monitoring performance, we'd like to know how your team is approaching it.

Krinkle subscribed.

For the record, there was good communication here throughout and the described safeguards and implementation looks great, possibly more conservative than needed, but certainly a very solid starting point. The only remaining thing I'd like us to do here is, as Larissa mentions, to have a look at the monitoring, and to have a look at how it actually works in production on the pilot wikis and run a few ad-hoc tests and report back on those.

Assigning to @aaron for that second part. Note, it might take three weeks before we get to it, given our offsite next week.

No issues with the backend use stand out to me. I'll leave the more frontend review to @Krinkle.

aaron claimed this task.

Closing this now, since both front-end and back-end sides have been looked at.

It would be good to keep an eye on https://grafana-rw.wikimedia.org/d/000000559/api-requests-breakdown?orgId=1&refresh=5m&var-metric=p50&var-module=parse

We wonder if the Grafana dashboard mentioned in the description was created. We are collecting data on how teams are monitoring performance, we'd like to know how your team is approaching it.

I'm sorry to report we did not get around to creating the Dashboard, but it sounds like the dashboard Aaron linked to (thank you!) is what we want? That covers request rate and execution time. Right now Realtime Preview is available on all wikis as a Beta feature. Once we graduate it from Beta, I believe the plan is to have it on by default for all users. Certainly when that time comes we will closely monitor the dashboard and take action accordingly. I have added this as a bullet point to our rollout task (T303961) so we don't forget.

@aaron @Krinkle While we have your attention, may I ask what you might consider as a problematic spike in action=parse API reqesuts/execution times? The request rate will of course go up after we enable it for all users, but I think that by itself is expected and probably okay. I'm guessing it's the execution times that deserve more attention? They shouldn't change much if all goes "well", because Realtime Preview will disable live reloads if action=parse is too slow. Is there anything else in particular we should be mindful of?

Thanks as always!

@MusikAnimal The distribution of parse requests itself wouldn't be a concern per-se, unless we find it affecting other layers of infrastructure (e.g. overall appserver, or DB, or memc).

I'd say the spike might be reason for concern if we can correlate it to a negative shift in latency figures on Grafana: Appserver RED dashboard. Specifically, with cluster=api_appserver method=POST, I would look at the estimated mean and p75 latency (~79ms), and the percentage of reqs within the 100ms threshold (~81%).