Page MenuHomePhabricator

Unconference: how to restructure the current rest api
Closed, ResolvedPublic

Description

Given we want to remove the second caching layer that RESTbase is creating, we need to think of how to change the individual services access patterns.

Think includes, and is not limited to:

  • mathoid
  • PCS/MCS
  • parsoid/php
  • wikifeeds (?)

Notes: https://etherpad.wikimedia.org/p/Wikimedia_Services

Event Timeline

Notes from discussion at TechConf2019

Doc: https://docs.google.com/document/d/1vfgMJBNIcbXDv4blyE6ni62kErPFUfpsXzuytvnG79o/edit#heading=h.7xfh8koom2yt

  • History:
    • Restbase was invented as a support mechanism for Parsoid when it was first instantiated as a JS service
    • It provides a persistent cache with pre-generation
  • Initial doc created by Petr and Eric
    • Doc is structured as a set of recommendations with the intension of driving a roadmap of how to sunset RESTBase
  • Patterns used in RESTBase
    • Persistent caching - change propagation system that causes cache regeneration
      • Nothing has ever been measured in terms of whether or not that cache is needed
    • RESTRouter is a proxy
      • It supports exposing public REST API as this did not exist before now in MW
      • For example, reading list extension which rearranges some url parameters which is no longer really necessary
      • With MW REST API this need behaviour becomes redundant
    • Composition
    • Access control
      • Can check if certain pages were deleted, because the cache is persistent we just flag that it was deleted
      • There is ad hoc replication of MW into Cassandra, because parsoid and restbase were outside MW you could avoid spinning up MW
        • Now that Parsoid is itself inside MW and we now want to move storage for that into MW this becomes redundant
    • Title Normalisation
    • Some security header setup and management
  • Recommendations
    • Avoid unnecessary replication; e.g. the ad hoc replication of MW into Cassandra
    • Avoid premature optimization; e.g., storing responses for MCS -- no measuring to see whether we needed to do that
    • Tend toward Just in time Engineering rather than making assumptions about needs
    • Persistence
      • Parsoid, now in core, so a lot of its use-cases of RESTBase are now no longer necessary
      • Mathoid -- RESTBase stores the hash of LaTex math input to png output -- should move closer to where other media is stored
        • Service could move to self managing of storage
      • Routing
        • After we have removed the above mentioned behaviours we believe that it is likely the remaining requirements for RESTRouter will closely map to existing solutions
  • _Joe_ speaking about existing, off-the-shelf solutions
    • Storage, access control, RESTBase provides these for services
    • We want to have a very simple rest router that directs the request from the edge through to the service
      • The service would become responsible for managing its own access control for example. This can be done via the MW REST API
    • For caching, a service should evaluate whether or not it needs caching and, if so, find some way to provice that cache
    • Rather than the behaviour being spread across multiple layers, we would move the responsibilities toward the service supported by MW and help determine what the service does actually need
    • API Routing to route to services, and then let individuals services find their own solutions
  • Q: What is the expected performance of AQS from a Product perspective?
  • A: Yep we know that
  • If we have that information we can mkae our decisions based on that, our premise that we do not have that information for all or even most services
  • Cache invalidation is one othe hard problems in computer science so we should consider whether or not we even need a cache
  • AQS has evaluated their service, their needs, tolerance and comfort zone so that they can accurately determine the caching needs
  • tl;dr: we should not dedicate the functionaliy of a service by a layer that lives in front of the service
  • M: How many services are there that are based on the node service template?
  • P: Maybe like 20 -- it's 10s not 100s
  • M: I did this cache evaluation and it took about a week?
  • G: A lot of what we're caching we're caching at the edge
  • Timo: You mentioned that this is an antipattern, but isn't this what we do with Varnish?
  • G: Varnish is a response cache, you could argue RESTBase was both for internal services and acting like an edge cache.
  • Right now restbase doesn't allow you to cache common parts, for example, language split. Services should be able to define how their cache will behave so that it is optimised for the services needs
  • Mate: When transitioning Wikia's microservice environment to K8s, we undertook a similar transition in that individual containerised services that previously delegated their authentication needs to an intermediate API gateway became fully responsible for their own authentication needs, while HTTP routing was taken over by k8s's ingress abstraction. Main point is it does not have to custom built it can be a drop in solution (k8s ingress, HTTP proxy etc)
  • Alex: 9 services that are installed and deployed currently use service-runner
  • Timo: You mentioned title normalisation, in what scenario should that actually be used? Surely any title you get should be already normalised, is that something that is used or is it something we might lose in the transition?
  • Petr: we normalize all titles in the rest api because we want to store it in the cache only once
  • G: The reason to normalize titles is to optimize caches, we do that in varnish, which is different than the normalization rules of RESTBase
  • James: RESTBase's normalization was wiki-specific
  • Timo: In what case would a non-normalized title be a valid request for an API?
  • Petr: we do have some things requesting non-normalized titles
  • Timo: if you don't know that a page exists yet, there is an api for that, but potentially we hiding bugs by redirecting or "fixing" behaviour
  • James: You're essentially suggesting breaking the API; i.e., requesting "foo" vs "Foo" is now a 404 -- sorry Kiwix
  • Timo: is that doc public, will it be?
  • Petr: Next step will be to publish on wiki
  • G: We also need to update the external services doc detailing behaviour and expectation changes
  • Alex: what happens to changeprop?
  • Petr: It might disappear it might not, because essentially the benefit of changeprop is you can create static dependency loops, eg, you store parsoid html, if you only need to purge then reegeneration is less important
  • G: expanding on Petr (1) validation and pregeneration -- if rendering takes 1 minute and you want to respond quickly to users then pregeneration makes sense (2) propagating purges -- that's how we propagate purges -- even up to varnish. The JobQueue does it for MediaWiki jobs and manages the purges that MediaWiki controls. MCS takes things that come from a wiki and get cached at the edge. If we had purge-keys in the edge then we wouldn't need changeprop, but right now we do.
  • Timo: about changeprop, you mentioned alot of the metrics for it haven't been verified or justified. That's true, but there is one for VE, that is a complex problem so there is likely justification there. However, for the VE usecase I don't see a need for changeprop
  • Petr: that is where we started from -- i.e., to migrate parsoid storage into MediaWiki and then you don't need any of that
  • James: do we have outline performance metrics for parser cache/
  • G: we're still storing in RESTbase, not in parser cache. Pregeneration should be a mindful decision and not a default. Not caching is much less painfull in general. If you need caching then you should create a strategy, but until you know you need caching then you shouldn't cache. The lag in terms of cache invalidation propogation is a problem, for example. tl;dr: it needs to be a mindful decision and not the default
  • Moritz: How can services manage their own cache? It seems difficult because they don't have the capability to decide. Why can't the service just say the output is deterministic and then be agnostic of the caching? As an outsider from the foundation it's hard to see if it's required or not.
  • P: there are several types of caching, RESTBase did it all. For mathoid frex, for mathoid we didn't do caching we did storage.
  • Caching should be done at the edge and it will be transparent to you, simply the addition of a header to responses
  • P: for purging of varnish cache for example, RESTBase makes the decision to purge or not depending on the response from mathoid, which is not its responsibility
  • G: Varnish handles the caching but caching that is determined to be needed by the service should be specific
  • M: It's hard for someone with no access to data or stats to develop that upfront.
  • Timo: Edit stashing, particularly parsoid version, MW has its own as well. Consolidating object stashing seems approachable. For MW the parser cache is MySQL backed but it's edit stash is actually in memcache, is that a problem?
  • Petr: we don't know. That is a problem. For that stash, there are 2 types, they both have the requirement that they should persist with a TTL
  • Timo: for mw edit stashing is an optimisation for performance, if memcache is broken we just don't store, but for parsoid there is some contractual expectations around persistent storage, this might be another Kask situation?
  • G: or mysql, why not use that? it works very well. A comment on this, whenever you make something taht needs to be written and read from multiple senders -- cassandra may be a good candidate. If you are already going to use the database, why not store in the database?
  • Timo: MySQL is fast if you do simple things with it
  • Petr: I said we don't know alot, we intentionally have not gone into implementation details to avoid that trap of getting lost in the details. We tried to make this non-controversial, would anyone be very upset if we destroyed restbase
  • Everyone: No one was immediately upset
  • Depends on how much work you make us do
  • G: Zero.
  • Dan: there's a non-zero amount of work; i.e., AQS has to change
    • It makes more sense for us to do the work ourselves
  • G: restbase will be reimplented without the caching, storage, everything else should be exactly the same
  • D: is the swagger spec going to be exactly the same?
  • G: yes, it's a very general thing
  • G: so for those worried about the work, the first thing is you need to speak to your product manager, then determine the stats around what is needed for your service, need to understand what is the P95 at the edge, is that inline with what we need? If yes, then you do not need additional caching/storage
  • Petr: There will be work, it's not new work, it was work that was skipped originally, we leap frogged the work that should be done
  • Mate: MySQL vs Cassandra -- using MySQL via MW's DBAL can be a more portable solution, as it doesn't assume that everyone outside of WMF has Cassandra up and running in their own env
  • Moritz: I want to say something about the redesign for services, it would be good to see services as a lambda, but you should also save metadata around the services so that you can share back the information on what is the expected response time, it wouldbe good to have a service template that people can use
  • Alex: RESTBase and the patterns that grew out of that prevents us from adopting SLIs errors and latency; SLOs i.e., I should never be slower than X which is something that product managers
  • G: there is a tendency to say "Faster" but that is not acceptable since you have to spend asymptotic amounts of money and time
  • Timo: Swagger can you talk about that UI would look like?
  • G: I don't think there is a a clear understanding of how the restapi of mediawiki is going to be implmeneted in this api gateway. In the RFC we talked about the restapi structure for mediawiki itself vs what the api for the public will look like
  • P: There is a initiative for the CPT to build a doc portal -- that would encompass all the documentation
  • T: so that would encompass that RESTBase is a swagger front-end?
  • Will: developer portal -- for the restapi -- we have a tech writer on our team so the portal should lauch with the restapi info
  • T: changeprop -- there may not be a need for it anymore --
    • this sounds a lot like what refresh links does in MW
  • Could Daniel talk abit about the changeprop tasks you have open?
  • Daniel: changeprop -no but propogation of changes, yes
    • The idea is that as we are increasing the UI from being one thing to another, as we transform the data or extract things from the page and render separately for different clients, outputs etc
    • Each of these should have dependency tracking, each time we invent a new type of thing like this we have to also find a new way to track it
    • Part of the arch work of FAWG is how we compose things and we have to thing about caching and purging
  • G: so basically now changeprop is neccesary for purging at the edge -- it's just not as automatic
  • D: everytime an artifact is created, we should generate a url that was used to create it, building a system that tracks that is an interesting problem -- that's a really big graph to maintain
  • W: to what moritz was saying -- specifically for Mathoid -- on the dev/maintainers list CPT is the owner of mathoid. The SLOs and SLIs are on our team rather than the responsibility of the community even when the community has developed that service.
  • G: keep an eye on wikitech-l for more information about this

@Joe / @Cparle: Thank you for proposing and/or hosting this session. This open task only has the archived project tag Wikimedia-Technical-Conference-2019.
If there is nothing more to do in this very task, please change the task status to resolved via the Add Action...Change Status dropdown.
If there is more to do, then please either add appropriate non-archived project tags to this task (via the Add Action...Change Project Tags dropdown), or make sure that appropriate follow up tasks have been created and resolve this very task. Thank you for helping clean up!

Cparle claimed this task.