Page MenuHomePhabricator

[DRAFT] Performance team goals, 2015-2016 Q2
Closed, ResolvedPublic

Description

Please feel free to edit and to link each item to the appropriate task, creating one if necessary.
  • Weekly performance reports with KPIs (Ori)
  • Multi-DC MediaWiki (tracking: T88445)
    • Swift repl. solution
    • 100% readiness of PHP code
  • Solve all (5-6) thumbnailing problems on a VM.
    • Scaling in a service outside MediaWiki.
    • New hash-based URL scheme
    • Article purging on image update (not a goal, but side effect of point below probably?) it's a prerequisite for the following point
    • Far-future expires for thumbnails.
    • Hash-based purging.
    • To swift or not to swit, for thumbnail storage.;

probably only for reference-size thumbnail (incl.each page in multipage format) and unusual formats that thumbor doesn't support yet
Hm.. so we'd the other ones in thumbor? If not, the storage (e.g. varnish) can be for both mw-scaled and thumbor-scaled thumbnails.
Yes, varnish would cache both JPGs/PNGs generated by thumbor and the rest still coming form mediawiki. Thumbor will have its own file-based cache in lieu of swift for "safety" in case of varnish disaster.
Ah, so thumbor would also store each thumbnail it generates.
Yes, with an expiry, unlike swift
(deterministic url-hash based host selection, ala varnish-backend?)
not sure how that works for varnish backends, but yes we can be smart about how we pick thumbor servers. especially since we can have thumbor in the PoPs.
FYI, out-of-the-box thumbor does JPG and PNG well, which is 91% of files on Commons
Bonus: Thumbor supports webp rendering. We could automatically serve webp to browsers that support it. Might have community implications, though ("breaking" right-click + save)

  • Slow parse log community tool + page performance debugging tool
  • Reading team performance process
    • Performance budget + synthetic testing (shared goal with Reading)
    • WebPageTest infrastructure
  • Programmatic pool / depool of application servers (blocker for RepoAuth) (Ori)
  • Continuous ResourceLoader caching (Timo; tracking: T102578)
  • MySQL monitoring - Anemometer (forwarding slow query log to central aggregator and deploying UI tool)
    • Use DBPerformance log instead (beef it up if needed, already logs slow reads/writes/transactions)
      • Kibana has various features we can utilise to provide beter insight into the data and trends.
    • Questionable value given existing Kibana UI for queries from MediaWiki

Event Timeline

ori raised the priority of this task from to Medium.
ori updated the task description. (Show Details)
ori added a project: Performance-Team.
ori subscribed.

One of my hopes for the performance team is that with each quarter, we spend less time tightening the bolts on existing infrastructure, and more time in the hammock, doing forward-looking software design.

There is an opportunity to do some of that now in collaboration with the Services team. @GWicke is quite keen to experiment with Service Workers, because they could be an important piece of a future architecture, one in which Wikipedia content is represented as atoms of content which are rendered, combined, remixed, and adapted for different contexts, user experiences and use-cases, by a rich set of software components that talk with each other using open, standardized interfaces. Service Workers would potentially allow us to make this architecture seamlessly straddle clients and servers.

There are some short-term gains to be had by using Service Workers to serve skin and content separately, but if short-term gains were all that we were after, I would not be keen on utilizing Service Workers yet, since we know we have performance problems elsewhere that could be fixed using less exotic technologies and thus with substantially less risk of wasted effort.

But I do think that the fact that so much of our architecture is oriented around full pages as the basic unit of content is going to increasingly limit the relevance of Wikipedia content, and that in the absence of a forward-looking idea on how to tackle that, what we are likely to continue seeing is tremendous redundancy in our network traffic and a lot of unnecessary CPU cycles spent on prying apart content that should have been kept separate to begin with.

So I'm interested in us finding a way to participate in this work.

Krinkle set Security to None.
Krinkle updated the task description. (Show Details)
ori claimed this task.