Since most of the issues are tackled on staging and some casual browsing doesn't show up any rendering issues we need to run difftesting between current prod and staging to check how many inconsistencies we have between the old and the new version.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | akosiaris | T198901 Migrate production services to kubernetes using the pipeline | |||
| Resolved | None | T321959 Tech Wishes - Maps service infrastructure deprecations | |||
| Resolved | elukey | T216826 Move Kartotherian to Kubernetes | |||
| Resolved | Jgiannelos | T384530 Difftesting between staging and production |
Event Timeline
- I created a dataset with Kartotherian URLs after parsing wikipedia articles that have kartotherian references from: {en,de,fa,zh,ja,ru}wiki
- Shuffled and sampled (stratified) 100 from each (total 600 articles)
- Fetched the kartotherian snapshot URLs from current prod (from maps nodes) and staging
- Calculated the SSIM (similarity index) of the 2 versions
- Exported the output of the diff image to be able to inspect whats happening
From a quick look to the results:
results['ssim'].quantile(([0, 0.25, 0.5, 0.75, 1.0]))
| quantile | ssim |
| 0 | 0.429146 |
| 0.25 | 0.938396 |
| 0.5 | 0.977298 |
| 0.75 | 0.998523 |
| 1 | 1 |
This means that we do have some inconsistencies between staging and prod but with a very high level look 75% of the sample has similarity more than 93%. On the bright side other than the similarity this test is a very good smoke test to see if something is wrong in the upgrade/migration but things look OK overall (no errors raised so expect of some transient issues most of the requests returned 200 status codes).
Change #1113816 had a related patch set uploaded (by Elukey; author: Elukey):
[mediawiki/services/kartotherian@master] blubber: add fonts-noto and fonts dejavu to the prod variant
From a similar run but testing A/B between staging(eqiad) /prod(eqiad) since in the previous test before it was targeting prod in codfw the results are:
| quantile | ssim | |-----:|---------:| | 0 | 0.799977 | | 0.25 | 0.969635 | | 0.5 | 0.993923 | | 0.75 | 0.999367 | | 1 | 1 |
Change #1113816 merged by jenkins-bot:
[mediawiki/services/kartotherian@master] blubber: add fonts-noto and fonts dejavu to the prod variant
Change #1113835 had a related patch set uploaded (by Elukey; author: Elukey):
[mediawiki/services/kartotherian@master] blubber: add more fonts packages to close the gap with prod
Change #1113835 abandoned by Elukey:
[mediawiki/services/kartotherian@master] blubber: add more fonts packages to close the gap with prod
Change #1113842 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] services: bump kartotherian's Docker image
Change #1113842 merged by Elukey:
[operations/deployment-charts@master] services: bump kartotherian's Docker image
Change #1114420 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] services: update Kartotherian's docker image
Change #1114420 merged by Elukey:
[operations/deployment-charts@master] services: update Kartotherian's docker image
Latest diff test run:
| quantile | ssim |
| 0.25 | 0.99175 |
| 0.5 | 0.998551 |
| 0.75 | 1 |
| 0.9 | 1 |
| 0.95 | 1 |
| 0.99 | 1 |
Looks much better after the latest patches
Change #1115049 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] services: set the Tegola's cluster local endpoint for Kartotherian
Change #1115049 merged by Elukey:
[operations/deployment-charts@master] services: set the Tegola's cluster local endpoint for Kartotherian
Latest difftesting after fixing localization
| quantile | ssim |
| 0.1 | 0.983166 |
| 0.2 | 0.992429 |
| 0.25 | 0.993921 |
| 0.5 | 0.998057 |
| 0.75 | 0.999939 |
| 0.9 | 1 |
| 0.95 | 1 |
| 0.99 | 1 |
@elukey I double checked the results and the issue was fixed. I am taking a look at the diffs that show some inconsistencies but its not something to worry so far.
Difftesting between current prod (bare metal) and k8s prod deployment (eqiad):
| quantile | ssim |
| 0.1 | 0.990876 |
| 0.2 | 0.994956 |
| 0.25 | 0.995939 |
| 0.5 | 0.999625 |
| 0.75 | 1 |
| 0.9 | 1 |
| 0.95 | 1 |
| 0.99 | 1 |
Quick back of the napkin calculation of latency in the response between A and B.
A: kartotherian prod (maps1009)
B: kartotherian prod k8s (wikikube worker)
| quantile | Percentage diff of latency between A and B % |
| 0.1 | -29.5699 |
| 0.2 | -20.8101 |
| 0.25 | -19.2296 |
| 0.5 | -5.18895 |
| 0.75 | 6.43051 |
| 0.9 | 16.1314 |
| 0.95 | 26.5609 |
| 0.99 | 127.555 |
Change #1115420 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] services: bump kartotherian's allowed millicores to 5k
Change #1115420 merged by Elukey:
[operations/deployment-charts@master] services: bump kartotherian's allowed millicores to 5k
After some back and forth with @elukey and increasing the cpu resources in kartotherian deployment charts here is some numbers that are a bit more useful.
results["diff_latency_ms"] = 1000 * (results["elapsed_b"] - results["elapsed_a"]) quantiles = results.diff_latency.quantile([0.1, 0.2, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]) print(quantiles.to_markdown())
A: kartotherian in current bare metal prod
B: kartotherian in prod k8s pod
| quantile | Difference in latency (in ms) |
| 0.1 | -619.083 |
| 0.2 | -343.267 |
| 0.25 | -263.88 |
| 0.5 | -52.129 |
| 0.75 | 49.224 |
| 0.9 | 135.147 |
| 0.95 | 195.49 |
| 0.99 | 680.492 |
Tests are running ~1000 kartographer URLs with allowed concurrency 16 tests in parallel.
Also here is a histogram of the diference in latency:
After testing the outliers on the higher end there is something wrong going with geoshapes rendering (timeout ?). Given how problematic geoshapes historically were it doesn't look like something we should worry.
Change #1116815 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):
[mediawiki/services/kartotherian@master] Make outgoing requests service mesh aware
Change #1116815 merged by jenkins-bot:
[mediawiki/services/kartotherian@master] Make outgoing requests service mesh aware
Change #1116833 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] kartotherian: update Docker image and geoshapes yaml config
Change #1116833 merged by Elukey:
[operations/deployment-charts@master] kartotherian: update Docker image and geoshapes yaml config
Change #1116881 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):
[mediawiki/services/kartotherian@master] Fix handling of getJSON with service mesh endpoints
Change #1116881 merged by jenkins-bot:
[mediawiki/services/kartotherian@master] Fix handling of getJSON with service mesh endpoints
A: kartotherian in current bare metal prod
B: kartotherian in prod k8s pod
Latest difftesting run after fixing hanging connections of geoshapes:
| quantile | ssim |
| 0.05 | 0.987757 |
| 0.1 | 0.993571 |
| 0.2 | 0.997578 |
| 0.25 | 0.998374 |
| 0.5 | 1 |
| 0.75 | 1 |
| 0.9 | 1 |
| 0.95 | 1 |
| 0.99 | 1 |
From a quick look on the diffs on the lower side of similarity its mostly font issues because we use new fonts which introduce improvements overall.
Regarding latency:
results["diff_latency"] = 1000 * (results["elapsed_b"] - results["elapsed_a"]) quantiles = results.diff_latency.quantile([0.1, 0.2, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]) print(quantiles.to_markdown())
| diff_latency | |
| 0.1 | -184.014 |
| 0.2 | -86.6914 |
| 0.25 | -65.8558 |
| 0.5 | 9.0725 |
| 0.75 | 73.6645 |
| 0.9 | 137.611 |
| 0.95 | 189.539 |
| 0.99 | 710.617 |
And here is the histogram of the latency change.
Here is the latency quantiles in ms for each A/B test run.
So overall there is no big difference in latency in the k8s deployment
I things its pretty safe to continue with the migration. Closing this ticket for now. We can run the tests again in the future if its needed.



