Page MenuHomePhabricator

pt.wiki latency P95 + P99 in apps pre- and post-disabling of pregeneration in PCS
Closed, ResolvedPublic

Description

Whilst we have been measuring the average latency for pt.wiki we want to know what the impact is for the worst 5% and worst 1% of traffic.

Could we get a comparison for the 95th and 99th percentiles before and after removing pregeneration from pt.wiki for both android and iphone

  • Only portuguese selected as language
  • Breaking down by country

Related Objects

Event Timeline

Data

Data Sources:
Android: event.android_app_session
iOS: event.MobileWikiAppiOSSessions
Note: Data for Android only begins at 2023-02-22 so iOS data has been trimmed to correspond
Data points are average latency in ms per user session, by OS and user country. Android data is session_data.page_load_latency_ms and iOS data is event.page_load_latency_average.

Note these are not huge datasets, available session latency data events per group:
Event Count

Android Portugal1496
Android Brazil9670
iOS Portugal1437
iOS Brazil9465

Here are the results for Portugal:

iOS

50th 95th 99th
Average before 354.1153846 923.67 2257.36
Average after 476.1052632 1420.97 3530.5185
Percentage increase 34.44918912 53.83957474 56.40033047

50th 95th 99th
Peaks before 413 1939.15 4385.68
Peaks after 575.5 3671.75 7950.6
Percentage increase 39.34624697 89.34842586 81.2854563

Screenshot 2023-03-31 at 19.01.56.png (584×929 px, 134 KB)

Screenshot 2023-03-31 at 19.14.34.png (632×1 px, 68 KB)

Android

50th 95th 99th
Average before 446.5384615 848.24 1271.092
Average after 545.6315789 1256.705 2334.494
Percentage increase 22.1913958 48.15441385 83.66050608

50th 95th 99th
Peaks before 547 1024.5 1661.75
Peaks after 681 2179.9 8059.85
Percentage increase 24.49725777 112.7769644 385.0218144

Screenshot 2023-03-31 at 19.02.34.png (465×742 px, 93 KB)

Screenshot 2023-03-31 at 19.14.54.png (595×964 px, 73 KB)

iOS data expanded to include events from 2023-01-01 - present Data

Portugal is in our top 50 countries for pageviews and installs on Android and iOS, if we can't decrease the latency here this within two weeks we need to roll back the changes that were made. We also shouldn't make these changes for other languages until we can decrease these numbers.

Using iOS data which stretches back to January to get a better picture:

Brazil

			50th		95th		99th
Average before		338.2878788	1017.234091	2308.33
Average after		474.2380952	1384.242857	2832.429524
Percentage increase	40.18772914	36.07908637	22.7047053

Portugal

			50th		95th		99th
Average before		396.6893939	879.5757576	1844.079091
Average after		503.8571429	1231.847619	2303.255238
Percentage increase	27.01553169	40.0502013	24.90002459

Thanks for the update @JTannerWMF @SNowick_WMF @Seddon.

Portugal is in our top 50 countries for pageviews and installs on Android and iOS, if we can't decrease the latency here this within two weeks we need to roll back the changes that were made. We also shouldn't make these changes for other languages until we can decrease these numbers.

We already have some improvements in flight to improve caching efficiency on the edge. I would prefer to give it a bit more time (two weeks sounds reasonable) to see how it performs and if the fixes don't give us good results we rethink our caching strategy.

After working for the past 2 weeks on trying to improve caching efficiency while we get some production traffic it looks like the cache hit ratio we managed to achieve (from ~10% edge cache hits to ~30%) wasn't good enough to bring parity between the previous and the current state (with and without storage on RESTBase level).
I am planning to revert the changes to use pregenerated PCS content while we work towards finding a solution.

I've update the relevant ticket with a lot of details on what we found during the time PCS storage was disabled:
https://phabricator.wikimedia.org/T314770#8776938

I think it can be useful insight in comparison with the client side analysis done in this ticket in order to understand better how we move forward.

Change 908269 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/restbase/deploy@master] pcs: Re-enable storage for PCS endpoints

https://gerrit.wikimedia.org/r/908269

We are going to monitor how re-enabling brings numbers back again to the previous latency standards.

Change 908269 merged by Jgiannelos:

[mediawiki/services/restbase/deploy@master] pcs: Re-enable storage for PCS endpoints

https://gerrit.wikimedia.org/r/908269

The changes are now reverted in prod, PCS is using again pregenerated content. I will keep an eye the next few days to see if the numbers go to the previous levels.

Latency looks like its back to the previous numbers. Is there anything left for this ticket or should we go ahead and close it?