Page MenuHomePhabricator

[EPIC] Web Performance SLOs CI
Closed, DuplicatePublic

Description

We need to set performance regression alerts for a number of SLOs. We want to detect performance regressions at the time a patch is merged.

SLOs:

  • Time to First Byte
    • Goal: <= 800ms on average hardware
    • Budget: < Goal on average hardware or the highest time in the last two weeks
  • First Contentful Paint:
    • Goal:: <= 1.8 seconds on average hardware
    • Budget: Goal on average hardware or the highest time in the last two weeks
  • Largest Contentful Paint:
    • Goal: <= 2.5 seconds on average hardware
    • Budget: < Goal on average hardware or the highest time in the last two weeks
  • Total Blocking Time:
    • Goal: < 200ms on average mobile hardware
    • Budget: < Goal on average hardware or the highest time in the last two weeks
  • Cumulative Layout Shift:
    • Goal: <= 0.1
    • Budget: < Goal on average hardware or the highest time in the last two weeks

For more context and references please check this document https://docs.google.com/document/d/1sqRMjG8NqF7sLZoiNtcHI09oYAyeVSAx7LHSSCbqUL4/edit#heading=h.q4s6y6l2cibx

We need to:

  • Define the fault tolerance.
  • Capture current number and compare it against the average of past 2 weeks to 2 months.
  • Set Gerrit CI steps to flag regressions
  • Create Mail alerts for regressions that appears on live deployment as fail safe

Related Objects

Event Timeline

Jdlrobson renamed this task from Web Performance SLOs CI to [EPIC] Web Performance SLOs CI.Jun 22 2023, 5:53 PM
Jdlrobson assigned this task to Mabualruz.
Jdlrobson moved this task from Incoming to Epics/Goals on the Web-Team-Backlog-Archived board.

Recommended next steps:

  • Convert this to an epic around Perf SLOs (Mo)
  • Have an implementation planning meeting (can be in SHDT) for engs to break this down into smaller subtasks (tentatively: dashboards, documentation, CI/CD hooks, etc) (SHDT to discuss, Mo to drive)
  • Sync with Olga on this to ensure we're tracking this work going forward as a priority (Jon + Nat)
  • Reach out to Reuven in SRE - he's got a workshop we can do to formalize these, we should run through this (I can do this)

Chatted to @Mabualruz about this today. It's not 100% clear to me what the web team can do to make progress in this ticket. Have asked Mo to talk to the team in SHDT tomorrow to identify scope, to talk to Peter/SRE about what's possible and articulate what is achievable here so we can make progress.

Hi @Jdlrobson and @Mabualruz I can help out bounce ideas and implement parts of it, just let me know.

Hi all some discussions and updated for this is on a subtask https://phabricator.wikimedia.org/T345997.

@Peter Thanks for the offer I might really need your help in:

I will keep you posted.

Hi @Mabualruz

especially data sources for the graphs.

FYI I've moved the dashboard I'm responsible for to https://grafana.wikimedia.org/dashboards/f/244t5tWSk/quality-and-testing-engineering-qte to try to cleanup the old performance team folder. And that time I added a row on each dashboard called Meta that holds information in data sources and update frequency. Please checkout if that is enough or if I should add some more meta data.

CI pipelines and automation.

This should run as a project and I can be a part of that and help out. I think it's quite much work that needs to be done, and I thinks a first step maybe we could tune some of the alarms that the performance team used to have so the web team can use them? Then we can take on the task of setting up the CI?

Today we have Fresnel run in the CI and I think that would work well for asset size changes. For acting on changes for web vitals/performance metrics we need some more work. There's a couple of things that makes these metrics more complicated:

  • To get reliable metrics we need real articles with a lot of different kind of content when we test in CI. These metrics are really dependent on what content we actually show, that's why we historically been testing in production.
  • The machines that runs the actually tests should probably not be cloud machines. These tests are really dependent on what kind of CPU power the instance have at the time the tests runs, so best case scenario the tests should run on a standalone server (or mobile phones for mobile web). In T285203 I'm opening up for us to actually do that (as long as the server has access to a deployed version of MediaWiki). But before we do anything like that, we should run a test and see what kind of variance we get over time if we would run in our current CI setup. For example GitLab used to run sitespeed.io in their CI setup but only collecting scores instead of performance metrics and I think the reason was that performance metrics are more demanding on the infrastructure.
  • I'm thinking about of how fast feedback we need from the tests? I think Mozilla do 21 runs for each URL they test and that takes some time to run. We had a session with Gregory from the Mozilla performance team that did some demo of their setup, I'll check if I can find that video and share it with you, I think that can be some inspiration.

I'm thinking maybe we could sync the alerts that you are using today with the one that we had in the performance team to get a more holistic view of the metrics and better alerts? Thinking we could combine the synthetic tests, the data from our real users and the data that we can collect from the chrome user experience report as a first step for the metrics you are interested in and then take on the job for the CI?

Hey @Peter,

FYI I've moved the dashboard I'm responsible for to https://grafana.wikimedia.org/dashboards/f/244t5tWSk/quality-and-testing-engineering-qte to try to cleanup the old performance team folder. And that time I added a row on each dashboard called Meta that holds information in data sources and update frequency. Please checkout if that is enough or if I should add some more meta data.

Thanks the new segregation and information added helps a lot.

This should run as a project and I can be a part of that and help out. I think it's quite much work that needs to be done, and I thinks a first step maybe we could tune some of the alarms that the performance team used to have so the web team can use them? Then we can take on the task of setting up the CI?

Let me check the calendar maybe we can do another recorded session to discuss that this week or next. The outcome of which should be tasks to take on gradually to achieve that goal, and check what is needed for that

Capture current number and compare it against the average of past 2 weeks to 2 months.
Set Gerrit CI steps to flag regressions

With some tweaks this are mostly the alerts that already exists except that those are on production instead of CI. The problem with finding regressions in the CI at the moment is:

  • Most performance regressions need real content to be visible. The setup in CI runs only Mediawiki. For example looking at group 0 wikis like mediawiki.org its harder to see regressions than on en.wikipedia.
  • The physical machines that runs our performance test on production has been carefully tuned to minimise the noice from the actual machine (for example the CPU speed is the same all the time). That tuning makes it easier to see front end regressions. In the CI the tests at the moment would run on cloud machines that do not have the tuning.

To fix this issues we would need more work to have real content in the CI and more machines that actually runs the tests and I'm thinking it not worth it. It would be better to trim the current alerts so it matches the metrics that you want to use.

@Jdlrobson I'm thinking I can set it up together with the web team but thinking you should own the alerts or how do you think? Setting up the alerts is one thing but who should act on them if they fire are another thing.

In the (really) old days with the performance team I think it was broken in the way that the performance team had alerts, acted on the alerts and fixed the alerts. Later we did the first diagnose trying to pinpoint what was causing it and then get the correct people to fix it. I'm thinking that the web team gets a lot of this for free because your in a position to know what's get developed and when things is pushed and have better insights on what can cause the alerts to fire?

I'm thinking like this: for the actual setup of the synthetic tests (the servers, the software, making sure metrics is collected and stored in a time series database) is QTEs responsibility. If the testing stops or if the metrics are unstable it's our responsibility to work on that.

For the alerts we could do those, but I think it's valuable if we do it together so we all understand how it works and can tweak them.

For acting on potential regressions, I'm thinking the web team is the most capable of doing that. But it could also be the team that is responsible for the Wikipedia user experience (which team is that?).

I'm open for suggestions, let me know what you think @Jdlrobson and I'll do what I can to help out!

For content perhaps https://m.mediawiki.org/wiki/Extension:MobileFrontendContentProvider might provide a solution?

I agree that web team should own and monitor these - I am just not sure how we get them setup.

Is this a case of just setting up 30m-1hr to set alerts up or is there more to this?(asking because the task suggests its an epic e.g. lots of work)

I would say a couple of hours to get familiar with the alerts. I'll support as much as I can.