User Details
- User Since
- Aug 14 2018, 10:50 AM (202 w, 2 d)
- Availability
- Available
- IRC Nick
- effie
- LDAP User
- Effie Mouzeli
- MediaWiki User
- EMouzeli (WMF) [ Global Accounts ]
Nov 10 2021
Oct 27 2021
Production URL testing (1.929.416 URLs) results in https://people.wikimedia.org/~akosiaris/prod_urls/. Findings for c=20, c=30, c=40 are consistent with what we have seen so far
Oct 26 2021
Parsoid testing, original images can be found at https://people.wikimedia.org/~jiji/benchmarks-parsoid/, our findings are similar to our previous tests. Baremetal performs better at low concurrencies, while k8s performs better at c=15 and up, while its >p90 is not always great.
Oct 19 2021
Oct 18 2021
Which brings again our usual issue, to cache slot or not to cache slot? If we don't want to slot the caches, we could consider following what we are planning to do in T290536
Oct 16 2021
Oct 15 2021
We are not using proxies anymore, but some TKOs we see every now and then could be related to T291385, not much we can do, closing.
Since we have no mcrouter proxies, and we won't have any scap proxies in the future, closing.
Oct 14 2021
Oct 13 2021
@Ladsgroup is this work still in progress or abandoned?
Oct 12 2021
@Legoktm we may debdeploy scap everywhere, and then for whatever reason we need to push change Y fast due to issue X. If scap fails everywhere because of a bug we missed, we have a problem where we first need to downgrade scap, and then rerun it. In my opinion, we should keep having scap sit on the canaries for 1 day, and save us from a potential scenario like this. To my knowledge, scap's test coverage is rather low (I admit I have not read scap code for quite some time). If this is still the case, gives us one more reason to want to be a little bit more careful with its rollout.
@Legoktm is working on a cookbook to speed up packaging of scap https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/727605. The rollout process has to stay as it is though (upgrade on canaries first, and roll out to all hosts after 1-2 days)
Oct 11 2021
Oct 8 2021
After the last tuning (APCu + memory limits), the results were more promising:
I think what we are missing here is how to get prometheus metrics strictly for the canary deployment. I confess I have not dug deeper into this.
Oct 7 2021
Running some tests (c=60, ~1.9m URLs) agains mwdebug services, we found 2 issues:
Tegola is running on kubernetes, Maps mirrored 100% of production traffic where we had no SRE-swift-storage issues. 🎉
Oct 6 2021
Oct 4 2021
I run an initial test running some 1000s of production URLs. It appears that we are about to hit max_accelerated_files (currently is 7963x12 pods = 95556). Looking at the same value on our production servers, 16229 is a possible value to set before moving forward. We will see if we need to bump opcache too.
@Joe did so, thanks.
Sep 30 2021
@dancy it would be lovely if we can speed this up, right now we have deploy1002 and maps* on version 3.17.1, and the rest on version 4.0.0.
Sep 29 2021
Sep 28 2021
@Ladsgroup run into this error:
Sep 27 2021
Sep 24 2021
Sep 22 2021
Thank you @ssastry, I updated the task descr to include them
Sep 21 2021
Thank @dancy, I will try to get it done this week with @Arnoldokoth
Sep 20 2021
@ssastry we have done some benchmarks, but non of those were parsoid urls, it would great if you would provide a couple of parsoid URLs you'd like us to test
We'll first roll out on our canaries and 5 parsoid servers, and continue with full roll out tomorrow.
Sep 16 2021
Same thing happened with 1,2 MB files, I think the problem is with thumbnails of files of certain size and up
Last set of benchmarks of Round 1, we added a run with 6 pods x 8 workers, no tideways installed:
Since Thumbor is being discussed here, I would like to point out a few things about Thumbor's situation and its infrastructure:
Sep 15 2021
Sep 14 2021
Last round of urls, same configuration, with the addition of a couple more requests: gerrit: 720061, where we set y=0. We get a better idea of how marginal differences are at low concurrencies in most workloads:
Sep 13 2021
After round 1 fixes, we run another set of 10k requests with and without xhprof. Results can be found here: https://people.wikimedia.org/~jiji/benchmarks-round1-all/. We have got mixed results, as a general pattern I will go out on a limb and say that at low concurrencies (c<20) baremetal performs marginally better or similar to kubernetes, while at higher concurrencies (c > 20), kubernetes performs better.
Sep 12 2021
Sep 9 2021
@Cmjohnson You can now remove any of the remaining hosts any given time, thank you!