Page MenuHomePhabricator

RFC: Render data visualizations on the server
Open, Stalled, MediumPublic

Description

Motivation

Documenting the spread of COVID-19 disease has exposed difficulties collaborating on data and rendering data visualizations. This latter problem is the motivation of this RFC, evidence can be found in T248707, T118783, and others. These front-end problems are related to the graphoid service in a way that not everybody agrees is necessary, so I will lay out the need for both the Graph extension and Graphoid as I see it. This part of the RFC is more complicated and subjective, so discussion and updates are especially welcome.

To draw a graph we need data and a function that does the transformation into visual objects. On top of that we can layer interactivity so people can play with the new representation. Data is usually a few hundred KB of text, but it can get up to tens of MB or more if you're pulling in TopoJSON to render maps and getting multiple dimensions per geographic object. Here are some product research questions and my assumed answers until better data is available:

  • how much data will our larger datasets have? Up to 10MB
  • how many people choose to interact with a graph? Less than 1%
  • how many different representations of data do we need? 99% of graphs will probably be maps, bar graphs, line graphs, and (::shudder::) pie charts

To show why these guesses matter, let's assume a graph rendering a world map like the linked article above. And let's walk through rendering this client-side-only or with the aid of a service. Let's assume 1 million daily views.

  • Service pulls in 1MB of data, renders a 60KB image. 99% of clients get the 60KB image and 1% get 1MB of rendering code plus 1MB of data. Bandwidth to serve: 80 GB / client CPU time: 14 hours
  • Without a service, each client would pull in 2MB and wait 2-60 seconds depending on their processing power, let's say 5 seconds on average. Bandwidh to serve: 2 TB / client CPU time: 2 months

So even if just one semi-popular article had a single graph, pre-rendering seems necessary. So, let's move on to the existing solution, Graphoid. This service tried to do everything that Vega is capable of, including pulling in and transforming data from external APIs. For this reason, it cut through our infrastructure in ways that left some scars. Problems are documented in the stewardship request, T211881, and mentioned in other RFCs and tasks T119043, T98940, etc. I hope this RFC can be a place where we centralize these discussions. These are my takeaways but do feel free to edit and add your own:

  • Stewardship request: T211881. The service integrates too deeply into page rendering. It created a situation where it has to know about the varnish caches, page property caches, parser caches, complicated timelines from when an edit happens to how it affects each cache, etc. And therefore folks working in those areas need to know about Graphoid as well. The next version should try to be as independent and functional as possible. To @akosiaris's point, this will also allow teams to work independently especially now that schedules are reduced. Also, another key point there is testability and performance monitoring. Graphoid calls out to mediawiki in a circular way so testing it in isolation can be tricky.
  • thoughts from @Bawolff: https://www.mediawiki.org/wiki/User:Bawolff/Reflections_on_graphs. My takeaway here is that we can and should support other rendering engines outside of Vega. Especially given the assumption about diversity of charts above.
  • initial motivation from @Yurik: https://meta.wikimedia.org/wiki/User:Yurik/I_Dream_of_Content. This vision was always great. The solution should not sacrifice the general idea that interactive and visual content are important.

The problems with Graphoid are:

  • supports old versions of Vega that require it to implement custom protocols
  • hard to upgrade to node 10, currently on node 6 (main reason it's being undeployed)
  • hard to maintain because of tight coupling with many and complex parts of our overall system
    • parsing: storing the graph definition in page properties violates what page_props were designed for and complicates: parsing a page with a graph tag and graphoid's job of fetching the graph definition
    • caching and invalidation: see patches trying to fix this in referenced issues
  • hard to test because of calls back to mediawiki API
  • noisy and annoying to SRE because of the HTTP errors from external data APIs

The requirements are:

  • easy to maintain, makes SRE happy
  • as separated from page rendering as possible
  • takes as many parameters as possible up front, to limit calls back to Mediawiki
  • updating when new data is available or graph is changed should not be made too complicated by the other simplifications
  • support rendering latest vega and latest vega-lite from the start (being able to upgrade or add a new renderer should be easy)

Affected Components: Graphoid and future Graph extension versions
Initial Implementation: @Milimetric (me) with @Pchelolo for code reviews and @akosiaris for "hey you're still doing it wrong"s (and the blessing of @Nuria, at least for now :))
Code Steward: @Milimetric until I find a team that enthusiastically wants to support this going forward

(NOTE: there are two types of maps: geographic and infographic. In this context, we're talking about infographic maps, where emphasis is on the data and not on accuracy of boundaries or landmarks. Basically, maps that vega can render as opposed to google maps, those are the domain of a different stack)

Existing plans and the Graph Extension

As of this writing, the Graphoid service is due to be undeployed. This proposal does not seek to block that work, indeed it seems like a good idea to undeploy the current service, make the graph extension capable of working client-side-only, and support it with a more stand-alone service going forward. The graph extension currently supports only Vega and the new service will be flexible. This is where we should keep in touch and do the product work to understand the needs of our communities so we offer the right features.

Indeed, the decision between the following proposals is partly about making the service easy to maintain, and partly about what we can support in the Graph extension. Basically, to render a static image of any graph (vega or not), as Gergo sums it up in T249419#6050917, we need three subsystems:

  1. fetch external dependencies (data and images like icons)
  2. render the graph
  3. store the image somewhere

I'm hearing consensus that storing the image in Swift, ideally on a path that's a hash of the graph spec, so we can essentially have Content-Addressable-Storage. That sounds great to me so let's focus on fetch, render, and control of the process.

I think we only have one real choice to make, and surprisingly it's what the graph spec looks like. Either it's opaque to MW or not. Bryan's idea, to write a CAS image url into the page to decouple MW parsing from graph rendering (T249419#6029620) is great, but the core of it can be applied in all scenarios. I'll use it below and also show it's orthogonal from cache invalidation. Like everything else, the way we handle cache invalidation really depends on whether MW can read the graph spec.

So, one proposal, with two options embedded:

Proposed Solution

  • MW (via graph extension) parses the page, sees a graph tag, reads the graph spec
    • [OPTION 1]: MW understands the spec, reads external dependencies
    • [OPTION 2]: the spec is opaque to MW
  • MW creates a content-addressable image url based on a hash of the spec, writes it to the output
  • MW queues a jobque job to render the graph at that image url and keeps parsing
    • [OPTION 1]: jobque job records external dependencies in *links tables (or similar) to later invalidate pages cached with this graph tag. It then calls the service with data and spec
    • [OPTION 2]: jobque job calls the service with only the spec
  • service renders the graph, either [OPTION 1] using the passed-in data or [OPTION 2] calling out for data itself
  • jobqueue stores the rendered image in Swift, at the image url that MW created

OPTION 1 means we wrap graph specs and declare statically what external resources the graph needs. These can be image URLs, data, etc. The biggest benefit of this is that we can easily let MW track dependencies for cache invalidation. The downside is that we can only support a subset of data visualization libraries like Vega.

OPTION 2 means the service needs to call out for the data. This means we have a loop of MW calling the service and the service calling back to MW. It's technically calling back just to get external resources. But one of those external resources could be data from a page that might have another graph on it, and I can see a malicious user being able to make a loop there. We could prevent this by restricting what allowed external dependencies are, but I think a clear requirement here is coming up with a very good proof to make SRE happy that it's not a real loop. The benefit, of course, is that MW and the service can be much simpler and end users have the most powerful possible graph spec. The other down side is of course lack of dependency tracking. The service could implement its own, but that seems wasteful unless the dependency system is refactored.

While going with OPTION 1 means committing to supporting our own wrapper spec, from the point of view of the service, the endpoints needed to support either option are very simple. The only architectural change would be to stop storing the graph spec in page_props. Besides that, I would clean up the code, upgrade dependencies, and add tests. So we could implement endpoints supporting both options and delay the choice until the graph extension is fully adopted by a team. This is what I'll be doing unless someone comments with strong objections.

Implications for the graph extension are minimal with OPTION 2. With OPTION 1, it would have to concede some flexibility and educate users on the new spec so they can write it. Or, of course, provide a nice editor that does it for them. Some thoughts on the graph extension in general:

  • UX of editors. Graph definitions should be easy to edit and validate client-side. Building templates on top of common types of graphs should be easy. MW templates don't seem like a great solution for this, and it's an open question for me whether editors will inevitably use MW templates for graphs or we can prevent it altogether.
  • It should be easy to copy and render old versions of graph definitions along with the data they referenced, if the APIs or URLs used allow it. This is where we may need to open up discussion about where we store these artifacts (MCR slots seem to make sense but I gather they're not ready for this use case).

Open Questions

  • Should we design a graph spec wrapper spec that can wrap any graph specs and statically expose external dependencies (OPTION 1 above)?
  • Where should we store the graph spec, given that it can be generated from templates? Should we disallow generating it from templates (can we even do that)?
  • Security considerations for private wikis (see task for previous Graphoid implementation: T90526)

Tagging other people that might be interested in this discussion:

  • @kaldari is working on a bot to update some of the most needed and critical graphs, that bot could use this service before even the graph extension
  • @Tnegrin is project managing the immediate response to graphoid/graph extension problems
  • @Abit is leading a weekly sync
  • @Seddon who is working on making the graph extension client-side-only

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 4 2020, 3:12 PM
Milimetric updated the task description. (Show Details)Apr 4 2020, 3:13 PM
Milimetric updated the task description. (Show Details)Apr 4 2020, 3:20 PM
Milimetric added a subscriber: Seddon.
Milimetric updated the task description. (Show Details)Apr 4 2020, 4:06 PM

Hi Dan -- thanks for this - great thinking. I'm going to loop in design on this. A couple of my thoughts:

  1. I think the discussion of maps needs to be made clearer. Maps were a big part of the Yuri vision and he implemented a similarly troubled stack to serve them. You mention maps but I don't totally understand what this means and how it interacts with the maps stack.
  1. I don't' want to get ahead of ourselves, but I wonder if there's an existing project that comprises a lot of the functionality we're looking for. An example would be RAWGraphs (I have no connection or knowledge of this project -- it was the top search results and looks reasonable). I also think that the provisioning of the data should be outside the scope of this project if that's not already clear.

-Toby

kostajh added a subscriber: kostajh.Apr 4 2020, 6:28 PM

I think the most complex (and underspecified) aspect here, is data invalidation. Which to be fair, the current solution handles by basically pretending its not a problem. However, graphs can include all sorts of additional resources. These resources aren't even recorded, and there is no cache invalidation when they change. What we do here is probably going to depend a lot on if we continue to pretend this problem doesn't exist, or try to address it.

If we're pretending the data invalidation problem doesn't exist, a very simple architecture might be:

  • MediaWiki takes an sha sum of the graph spec
  • Puts in an <img src=graphoid/AB324334....32.png"> into the page. Also puts the graph spec somewhere in the page. Maybe a data attribute, maybe a <script> tag at the end or something.
  • Varnish may have that in cache, in which case it responds. Otherwise, it responds with a non-image 404. [My assumption here, is that there is no filesystem image save here, just having big varnish cache]
  • JS has an onerror handler for the <img> tag
  • If the error handler is triggered, then client side JS sends POST request to graphoid with the graph spec [graphoid is responsible for fetching all dependent data. Assumption is that, if data is really heavy, its not the original graph spec that is heavy, but other data]
  • If graphoid returns an image, magic causes the image to be cached on the varnish layer for future GET requests

This would have the nice properties that: The mediawiki backend does not have to know anything about graphoid, and it works entirely independently from it. No state about the graph is stored anywhere in mediawiki. All state on the graphoid/varnish side can be regenerated on the fly, if need be.

Downside, there is no cache invalidation at all. Requires putting the graph spec into the page. I assume that graph specs (as opposed to topojson files) are fairly small when compressed, but i don't really know. Requires graphoid to make network requests to a bunch of arbitrary services.

I 've gone through a first reading of this. Thanks for summarizing the original intent behind the original service as well as the problems with the current implementation so nicely.

My first round of comments is going to be a bit meta and less technical, stay tuned for the technical part.

A couple of notes:

  • lots of errors connecting to external APIs for data, this makes SRE hard (section "The problems with Graphoid are:")

I think this sentence is not complete

  • Regarding the requirements, i.e.
The requirements are:

as isolated from page rendering as possible
takes all required parameters up front so as to not have to make calls to Mediawiki
takes all data up front so as to not need external APIs
updating when new data is available or graph is changed should not be made too complicated by the other simplifications
flexible as to the graph rendering engine. Anything that can render on nodejs should be allowed, not just Vega.

It probably makes sense to split them into hard and soft ones, that is MUST haves and NICE to haves. e.g. might want to loosen for the first implementation the graph rendering engine requirement to allow to target only vega (or dare I say a specific vega version?) at the beginning (by not running tests against other implementations?)

  • Many thanks for the following
As of this writing, the Graphoid service is due to be undeployed. This proposal does not seek to block that work, indeed it seems like a good idea to undeploy the current service, make the graph extension capable of working client-side-only, and support it with a more stand-alone service going forward

Decoupling the new service from the current one is great.

  • I find the point about the Graph extension and the new graphoid service needing to work closely together, at least later on, absolutely true.
  • Finally, even though it's probably far in the future, it probably makes sense to try and build some kind of nice UX for users creating/manipulating graphs. For example, disallowing submitting invalid graph specs and informing users of the errors, or provide a preview of the generated image in Visual Editor before the save.

Now on to the technical parts:

I really like this part

Input =>
* Graph spec. For vega, this would be the json definition, and it would include the data so there is some work here to deal with specs that include URIs
* Fully realized data, including artifacts like TopoJSON, country codes, etc
* Rendering Engine. For example: vega1, vega5, vega-lite1, or bar-chart (the implications of that last option are a big topic so initially just vega*)
* Rendered Image options. Size, format, etc.
Output <= rendered graph image

This approach makes it easy to test/debug/develop the service in isolation as the entirety of the input is in the control of the client, as opposed to instructing the service to load data from arbitrary sources. It's also inherently safer as we can sanitize way more easily an input coming from a single source vs input included from multiple sources.

An interesting caveat however is that this is a request that might take some time to serve as it might be CPU intensive (or the service might be under stress). There are 2 cases I see where this might be interesting.

  • Direct responses to editors for preview purposes. This means we will expose the service publicly (that's fine) and editors will expect a refresh of the preview every now and then (how often? on change, on demand or scheduled? I have no answer). Since the client is the browser in this case, we control the timeout and can wait for some time for the response to come back. I expect the editors that mess simultaneously with various graphs to be rather small subsets of our users so that should be fine. We already anyway have held open connections for other services in our infrastructure.
  • Responses to mediawiki. I expect these to happen on page save (at least?) where mediawiki will POST the graph spec to the service and expect back an image. This is more tricky. PHP does not have async functionality and blocking a worker thread waiting for graphoid to answer is not advisable (it might very well lead to a widespread outage in fact as stampede effects might come into play). There are a number of ways out of this, most notable and easy to work with is using the jobqueue. This should allow to spawn a job that can wait for the service to return the image without endangering the user serving infrastructure.

By the way, I am guessing in all of this, the input is a standard HTTP POST query, right?

Now on what happens with the generated image. You say that an approach would be to save the resulting image URL with the page. It's unclear if you mean in the database (which is probably a bad idea) or not. It would be best to clarify.

But my proposal would be to use swift for this. It's the place were we anyway save all images and we can easily have the jobqueue do that. It will mean the Graph extension will have to put in the parsed page a way to refer to it, the idea @Bawolff about using a hash of the graph or image would be my preferred approach currently. This is essentially content addressable storage[1] and has the nice benefit that an update to the graph spec will change it's address as well, avoiding needing to invalidate the generated images. It does however pause a question of whether we should implement eventually garbage collection (we might never need to, but who knows?)

I agree with the proposal about wanting to save the original graph spec. If we won't most editors will have a pretty bad experience (especially the ones trying to edit a graph from someone else). Whether an MCR slot is the best place for this original graph spec is up for discussion I guess, but to me it does make some sense.

As far as UX goes, specifically UX of editors. Graph definitions should be easy to edit and validate client-side, feel free to expose the new service to the world and use it for the validation client-side, at least for the beginning. While there will be some back and forth of data and it could be burdensome for some editors (really complex big graphs will take a long to be POSTed), it might be a good compromise at the beginning. The other benefit is that it keeps the "business logic" in a single place and as such there can be no discrepancies between e.g. the graph engine versions running on the clients and the service.

I hope this helps. I am open for clarifications/questions.

[1] https://en.wikipedia.org/wiki/Content-addressable_storage

I think the most complex (and underspecified) aspect here, is data invalidation. Which to be fair, the current solution handles by basically pretending its not a problem. However, graphs can include all sorts of additional resources. These resources aren't even recorded, and there is no cache invalidation when they change. What we do here is probably going to depend a lot on if we continue to pretend this problem doesn't exist, or try to address it.

If we're pretending the data invalidation problem doesn't exist, a very simple architecture might be:

  • MediaWiki takes an sha sum of the graph spec
  • Puts in an <img src=graphoid/AB324334....32.png"> into the page. Also puts the graph spec somewhere in the page. Maybe a data attribute, maybe a <script> tag at the end or something.
  • Varnish may have that in cache, in which case it responds. Otherwise, it responds with a non-image 404. [My assumption here, is that there is no filesystem image save here, just having big varnish cache]
  • JS has an onerror handler for the <img> tag
  • If the error handler is triggered, then client side JS sends POST request to graphoid with the graph spec [graphoid is responsible for fetching all dependent data. Assumption is that, if data is really heavy, its not the original graph spec that is heavy, but other data]
  • If graphoid returns an image, magic causes the image to be cached on the varnish layer for future GET requests

What happens though if the image has been purged from varnish (for whatever reason?). We need to store it somewhere if we want to be able to serve it and not have the users redo the entire POST dance. Hence my swift proposal above. Which begs the question of who stores in swift, mediawiki or graphoid? mediawiki already has this functionality, reimplementing it in graphoid my be a waste of resources.

This would have the nice properties that: The mediawiki backend does not have to know anything about graphoid, and it works entirely independently from it. No state about the graph is stored anywhere in mediawiki.

All state on the graphoid/varnish side can be regenerated on the fly, if need be.

That's not true unfortunately. If the service needs to make network requests to a bunch of arbitrary services to get data in order to populate the graph with data, there's a chance that data will differ between calls to them(e.g. covid-19 data changes every day, at times even previously posted stats are revised with newer data), meaning that state changes between graph renders. It's the content addressable storage idea above that saves us from having to invalidate varnish but that does not change the fact the 2 different calls to graphoid to render the exact same graph spec might return different images (and we might very well be unable to ever render a previous state, which might be something we want).

Downside, there is no cache invalidation at all. Requires putting the graph spec into the page. I assume that graph specs (as opposed to topojson files) are fairly small when compressed, but i don't really know. Requires graphoid to make network requests to a bunch of arbitrary services.

Having the client (I am assuming from the "client side JS" description above we are talking about a browser) fetch a potentially large graph spec, ship it over the internet and then back to our infrastructure feels like a waste of resources (not just ours) not to mention a pretty good increase of latency. What'sexacerbating it is that we are sending the graph spec to all users, regardless if they are editing or not, since it's always in the page. Pages with large graph specs (IIRC it's possible to have the data in the graph spec, so we can't rule it out) might very easily become unwieldy on low powered devices and consume a lot of data which might be important on metered connections. Finally having graphoid (or any service for that matter) fetch all the dependent data makes it more difficult to monitor, debug, test and benchmark.

Having the client (I am assuming from the "client side JS" description above we are talking about a browser) fetch a potentially large graph spec, ship it over the internet and then back to our infrastructure feels like a waste of resources (not just ours) not to mention a pretty good increase of latency. What'sexacerbating it is that we are sending the graph spec to all users, regardless if they are editing or not, since it's always in the page. Pages with large graph specs (IIRC it's possible to have the data in the graph spec, so we can't rule it out) might very easily become unwieldy on low powered devices and consume a lot of data which might be important on metered connections. Finally having graphoid (or any service for that matter) fetch all the dependent data makes it more difficult to monitor, debug, test and benchmark.

Another thing that just crossed my mind is that makes a prerequisite that we will expose the service to the public (this might happen anyway for other reasons anyway, e.g. using the service as a graph spec validation entrypoint) which increases the attack surface. Depending on the chosen architecture, this could be problematic. E.g. if the sanitization layer isn't built in into the service itself (for whatever reason). Parsing arbitrary user input without adequate sanization is a pretty sure way for a compromise. This opens up the discussion of how graph spec input is currently sanitized and whether the approach is still applicable in the new service or not (it should also influence the decision of exposing publicly or not the service).

Milimetric updated the task description. (Show Details)Apr 8 2020, 7:35 PM

Thanks all for the comments so far. As response to some questions, I've updated the description to clarify. Direct responses to the rest:

  1. I don't' want to get ahead of ourselves, but I wonder if there's an existing project that comprises a lot of the functionality we're looking for. An example would be RAWGraphs (I have no connection or knowledge of this project -- it was the top search results and looks reasonable). I also think that the provisioning of the data should be outside the scope of this project if that's not already clear.

This particular one is not a server-side renderer, it's more of a rich client. I think this type of work should definitely be incorporated with the Graph extension and Visual Editor along the lines of what Alex mentions in T249419#6032988. If we agree that having a service is a necessity, as I argue, then it should be written to our standards (service-template-node) which is what I'm proposing here. And to be clear, the actual code that renders graphs is totally third party, the service will just connect inputs to that rendering and do something useful with the output. So I think custom code is minimal but necessary for our SRE purposes.


  • Direct responses to editors for preview purposes[...]

Clarified description, but during preview the graph extension should exclusively use client-side rendering.

  • Responses to mediawiki. I expect these to happen on page save (at least?) where mediawiki will POST the graph spec to the service and expect back an image. This is more tricky. PHP does not have async functionality and blocking a worker thread waiting for graphoid to answer is not advisable (it might very well lead to a widespread outage in fact as stampede effects might come into play). There are a number of ways out of this, most notable and easy to work with is using the jobqueue. This should allow to spawn a job that can wait for the service to return the image without endangering the user serving infrastructure.

Yes, this is what Mathoid seems to have figured out several solutions. One is to return back a URL right away and work on saving in the background. It gets complicated, but let's see what the performance is like before we optimize too early. With all the data available, the response might be quick enough.

Now on what happens with the generated image [...] But my proposal would be to use swift for this [...] This is essentially content addressable storage[1] and has the nice benefit that an update to the graph spec will change it's address as well, avoiding needing to invalidate the generated images. It does however pause a question of whether we should implement eventually garbage collection (we might never need to, but who knows?)

Yes, swift was what Petr and I talked about as well, I kind of punted this until I understand better Mathoid's strategy and how it applies here. Garbage collection, definitely, but to do that we need to store the input for the graph somewhere, which is one of the bigger unsolved problems. (though lots of solutions seem to exist)

As far as UX goes, specifically UX of editors. Graph definitions should be easy to edit and validate client-side, feel free to expose the new service to the world and use it for the validation client-side, at least for the beginning. While there will be some back and forth of data and it could be burdensome for some editors (really complex big graphs will take a long to be POSTed), it might be a good compromise at the beginning. The other benefit is that it keeps the "business logic" in a single place and as such there can be no discrepancies between e.g. the graph engine versions running on the clients and the service.

I think I disagree here, getting instant feedback and validation client side is much nicer. You can even incorporate it in the editor, Vega and other tools have schemas that can be checked as you type. That's a much better UX than submitting and waiting or doing something fancier behind the scenes. I take your point that the service would be fine, especially early on, but this really is simpler to solve client side from the start. And good point on the client and service keeping in-sync with vega versions, but this really should be the main thing that we clean up with the new implementation. This is the primary reason I insisted on supporting both vega and vega-lite from the start, to make sure that different rendering engines are supported and easily maintained in the codebase. In a way, that's what this new implementation should optimize for.


I think the most complex (and underspecified) aspect here, is data invalidation.

Indeed. And I intend on either handling it or making it obvious that it's not handled. I think pretending it doesn't exist leads to lots of jumping through hoops, which is more complicated as Alex shows in his response to your solution (T249419#6033180). So, in my opinion, we need two things:

  • to find a good place to store the graph spec, in something like an MCR slot. That way if it changes, including the data it links to, the page is invalidated. (MCR is not required, but invalidation is, this would normally be handled by a *links table)
  • a new mechanism to invalidate, namely Time. We need to be able to say: the data this graph links to is not valid after 1 day. And then if we have trouble refreshing it, the graph extension can say "Invalid for 23 days" or "Refreshed 14 hours ago"

[...] Parsing arbitrary user input without adequate sanization is a pretty sure way for a compromise.[...] (it should also influence the decision of exposing publicly or not the service).

Vega and Vega lite have clearly defined specs for all versions. Any other rendering engine we use should have the same. I will require validating any user input, not using unprotected regexes on it, etc. You're right, this should be in the bylaws of any service :)

ovasileva moved this task from Incoming to Product Doing on the covid-19 board.Apr 9 2020, 5:11 PM

Heads up - this RFC will be discussed in the #wikimedia-office IRC channel on Wednesday, April 15, at 21:00 UTC. Please do continue the discussion here, and feel free to reach out to me on IRC before then. The meeting will be a nice way to broadcast this and attract more opinions, but no reason to wait until then.

Yes, swift was what Petr and I talked about as well, I kind of punted this until I understand better Mathoid's strategy and how it applies here. Garbage collection, definitely, but to do that we need to store the input for the graph somewhere, which is one of the bigger unsolved problems. (though lots of solutions seem to exist)

As far as UX goes, specifically UX of editors. Graph definitions should be easy to edit and validate client-side, feel free to expose the new service to the world and use it for the validation client-side, at least for the beginning. While there will be some back and forth of data and it could be burdensome for some editors (really complex big graphs will take a long to be POSTed), it might be a good compromise at the beginning. The other benefit is that it keeps the "business logic" in a single place and as such there can be no discrepancies between e.g. the graph engine versions running on the clients and the service.

I think I disagree here, getting instant feedback and validation client side is much nicer. You can even incorporate it in the editor, Vega and other tools have schemas that can be checked as you type. That's a much better UX than submitting and waiting or doing something fancier behind the scenes. I take your point that the service would be fine, especially early on.

That was my point exactly. In the long term, I think it would be preferable that the service would NOT be exposed to end-users (maintaining a publicly exposed API is way more arduous than an internal one), but if it would be beneficial for that to happen during the early phases of deployment we could do that.

but this really is simpler to solve client side from the start.

And that's nice to hear!

I think the most complex (and underspecified) aspect here, is data invalidation.

Indeed. And I intend on either handling it or making it obvious that it's not handled. I think pretending it doesn't exist leads to lots of jumping through hoops, which is more complicated as Alex shows in his response to your solution (T249419#6033180). So, in my opinion, we need two things:

  • to find a good place to store the graph spec, in something like an MCR slot. That way if it changes, including the data it links to, the page is invalidated. (MCR is not required, but invalidation is, this would normally be handled by a *links table)
  • a new mechanism to invalidate, namely Time. We need to be able to say: the data this graph links to is not valid after 1 day. And then if we have trouble refreshing it, the graph extension can say "Invalid for 23 days" or "Refreshed 14 hours ago"

With the fear of sounding pedantic, this RFC is about the service that will render graphs to images, not how graph spec and/or data will be stored on the mediawiki (or any other for that matter) side (It is a pretty interesting discussion and definitely needs to be held, just no in the RFC, mostly because it's a pretty big one). We have an RFC for that already as pointed out in the description).

As far as the service goes, up to now the proposal describes a fully stateless service which receives with each request the entirety of the data it requires to render the image, so as far as the service itself goes, we don't really have a cache to invalidate. We should however discuss (and in fact we already have) cache invalidation of the resulting images. Let me reiterate that I like the content addressable storage idea that @Bawolff described above, as that makes cache invalidation in the other parts of the infrastructure (especially edge caches) way less of a problem. Unless I forget something, by using that approach we limit the cache invalidation needs to bugs in the service returning invalid images and that's assuming the content addressable part happens on the input side and not the image. To make this more clear let me lay out an example:

  1. Let's the service has a bug that renders in some corner case a corrupt image.
  2. We have a graph (spec+data) that triggers the above
  3. The service renders the image and it's stored in swift with a file name of sha256(normalized(spec+data)).jpg.
  4. We deploy the fix for the bug above.
  5. We need to trigger a rerender of the affected images to invalidate them.

Which is already such a corner case (we did meet however a few of those with mathoid) that it just might make sense to handle it manually (perhaps via some batch script, or alternatively just relying on the edit rate of pages).

Alternatively, instead of storing the image as sha256(normalized(spec+data)).jpg we can store it as sha256(image.jpg).jpg, that is we checksum the result itself. But that makes the assumption that the transformation of the exactly same graph spec and data always lead to the exact same, bit for bit, image. Which is an assumption I don't know if we can make. My guess is not.

TechCom is hosting a meeting to discuss this on April 15th at 21:00 UTC in #wikimedia-office IRC channel

Tgr added a subscriber: Tgr.Apr 13 2020, 1:19 AM

So basically we have three subsystems here (which might be all in the same service, or distributed between multiple things):

  • Rendering: a lambda service that takes the graph definition and all dependencies (Commons data tables, Wikidata statements, interface message translations, possibly some kind of graph modules) and renders an image. This is stateless, probably deterministic, probably implemented as a thin wrapper around some third-party visualisation library (initially, Vega) that can also be run on the client side (since the image will have to be replaced with an identical JS canvas when the user starts interacting with it).
  • Fetching: retrieve all the dependencies that rendering depends on.
  • Caching: store the rendered image and serve it on subsequent pageviews. Invalidate it if the graph definition changes; account for invalidation of the dependencies. (For resources local to that wiki, or resources which already have an event propagation system, such as Wikidata, this will probably mean properly invalidating when there's a change event. For external resources, possibly involving resources from another wiki, it would probably have to be handled via expiry times instead.) On cache miss, probably serve some kind of "click here to see the graph" image that's just an entry point to the client-side renderer.

Architectural constraints / assumptions:

  • Can we assume that all data needed for the interactive experience can be fetched beforehand? (A hypothetical counterexample would be a map snippet that can be zoomed into; you can easily fetch enough data to render a static image, but all the data that might be required depending on where the user zooms would be way too much to handle.) If we cannot then the client-side rendering code will have to issue API calls, which brings up questions about how the server-side service prevents or handles those calls.
  • Can we assume that all data needed can be shipped to the client on a normal pageview? (Almost certainly not, it would have to be shipped on interaction with the graph only.)
  • Rendering speed and throughput: can we assume the rendering is fast enough to be called synchronously during page parsing? (Almost certainly not.)
  • Fetching speed and throughput: can we assume that fetching the data is fast enough to be called synchronously during page parsing?
  • Cache invalidation frequency: is it OK to invalidate the whole wiki page every time the graph data (but not the definition) changes; that is, to handle graph invalidation by changing the image source URL? That will result in re-parsing the page, which is potentially a lot more expensive than the graph rendering itself; and the data might change frequently.

Implementation options:

  • Rendering is straightforward, especially if we assume that the rendering logic cannot call out for data (looking at current use cases, this seems like a reasonable limitation). A node service using the usual conventions, wrapping mostly-third-party graph rendering logic.
  • If fetching can be done synchronously, then caching is most easily done via content-addressable storage. The most straightforward way to do this is to just get the graph definition plus all the data during page rendering, generate a hash, send the def+data to the graph renderer, store the resulting image under that hash. (This is how image rendering works on vanilla MediaWiki.) But that means transient graph rendering errors cannot be easily recovered from (one would have to re-render the page). Also, any change to the data would require re-rendering the page.
    • Alternatively, generate the hash, store the fetched data under that hash somewhere (internal cache? MCR field? custom DB table?), and then on a cache miss for the image retrieve the data and do the rendering. (This is how image rendering works in Wikimedia production, with Swift falling back to the Apache 404 handler which is wired into the MediaWiki thumb renderer.) On errors, just try the whole thing again on the next image request (maybe use Varnish magic to cache a short-lived error image). Data changes would still require re-rendering the page.
    • If fetching data cannot be done synchronously, or data changes are frequent enough that always re-rendering the page is unacceptable, then instead of a content-based hash we'd have to use some unique token that gets associated with the pending job for data retrieval & rendering; when those finish, the results get stored under the token. This will be a many-to-one relationship (since edits to the page will generate new tokens, even if the graph definition + data did not change), presumably to be handled with redirects.
    • As to the storage mechanism, either the caching subsystem could be its own node service (acting as a proxy between the renderer and MediaWiki) with its own image storage, or the images could be stored in Swift which forward the request to the service on miss (like it's done with Thumbor).
  • The fetching subsystem could be the same service as the caching one, or it could live in a MediaWiki job. Initially it might be the easiest to skip it entirely and just force users to transclude the entire dataset into the graph definition.
TheDJ added a subscriber: TheDJ.Apr 13 2020, 8:30 PM
CDanis added a subscriber: CDanis.Apr 15 2020, 1:58 PM

So basically we have three subsystems here (which might be all in the same service, or distributed between multiple things):

(such a good summary)

Architectural constraints / assumptions:

  • Can we assume that all data needed for the interactive experience can be fetched beforehand?

Not for the interactive experience, Vega can vary what it queries for based on how users interact with the visualization. But there's always an initial state of the graph, so we can render that as a static placeholder. This seems especially important for complex visualizations that need lots of data.

  • Can we assume that all data needed can be shipped to the client on a normal pageview?

This was what I tried to get at with my assumptions in the Motivation section. If those hold true, then it seems incredibly wasteful to ship data and graph specs to each client by default.

  • Rendering speed and throughput: can we assume the rendering is fast enough to be called synchronously during page parsing?

No, even if most graphs are quick to render, it's simple to create an outlier, even with external data for rendering available in memory

  • Fetching speed and throughput: can we assume that fetching the data is fast enough to be called synchronously during page parsing?

No, fetching could be arbitrarily complicated wikidata queries

  • Cache invalidation frequency: is it OK to invalidate the whole wiki page every time the graph data (but not the definition) changes; that is, to handle graph invalidation by changing the image source URL? That will result in re-parsing the page, which is potentially a lot more expensive than the graph rendering itself; and the data might change frequently.

Ideally not, but could we not decouple the freshness of the image from the freshness of the page, the same way that commons images are used?

Implementation options:

  • Rendering is straightforward, especially if we assume that the rendering logic cannot call out for data (looking at current use cases, this seems like a reasonable limitation). A node service using the usual conventions, wrapping mostly-third-party graph rendering logic.

Yep, I have a basic proof of concept up, took a couple of hours to write. The main complexity is in how and where to POST to such a simplified service, and there are a few problems, some of which you found below:

  • If fetching can be done synchronously, then caching is most easily done via content-addressable storage. The most straightforward way to do this is to just get the graph definition plus all the data during page rendering, generate a hash, send the def+data to the graph renderer, store the resulting image under that hash. (This is how image rendering works on vanilla MediaWiki.) But that means transient graph rendering errors cannot be easily recovered from (one would have to re-render the page). Also, any change to the data would require re-rendering the page.
    • Alternatively, generate the hash, store the fetched data under that hash somewhere (internal cache? MCR field? custom DB table?), and then on a cache miss for the image retrieve the data and do the rendering. (This is how image rendering works in Wikimedia production, with Swift falling back to the Apache 404 handler which is wired into the MediaWiki thumb renderer.) On errors, just try the whole thing again on the next image request (maybe use Varnish magic to cache a short-lived error image). Data changes would still require re-rendering the page.
    • If fetching data cannot be done synchronously, or data changes are frequent enough that always re-rendering the page is unacceptable, then instead of a content-based hash we'd have to use some unique token that gets associated with the pending job for data retrieval & rendering; when those finish, the results get stored under the token. This will be a many-to-one relationship (since edits to the page will generate new tokens, even if the graph definition + data did not change), presumably to be handled with redirects.
    • As to the storage mechanism, either the caching subsystem could be its own node service (acting as a proxy between the renderer and MediaWiki) with its own image storage, or the images could be stored in Swift which forward the request to the service on miss (like it's done with Thumbor).
  • The fetching subsystem could be the same service as the caching one, or it could live in a MediaWiki job. Initially it might be the easiest to skip it entirely and just force users to transclude the entire dataset into the graph definition.

The idea I'm exploring right now is:

  • While editing, if a user agent changes a graph spec, that user agent has to be able to simulate rendering the graph and collect all the data it would fetch (Vega can do this, so any agent that can run JS can do it too)
  • Before the save is complete, POST to graphoid (bikeshed name later) the graph spec and all external resources it needs (this includes icons!, results to wikidata queries, etc.)
  • graphing lambda service responds async with an image path (something in Swift)
  • The image path gets saved as part of the wikitext of the page
  • The graph spec gets saved in a new MCR slot

That first assumption is the part I know the least about and what I intend to ask in today's IRC discussion. If we can't assume that the editing agent can do that, then we need a secondary fetching service as you describe, and someplace to store the data, so at that point we should think more carefully about this requirement, and maybe just let graphoid make external calls but make it more robust.

Look forward to discussing this in a couple hours and ongoing, until we find a decent solution.

Tgr added a comment.Apr 15 2020, 8:14 PM

Ideally not, but could we not decouple the freshness of the image from the freshness of the page, the same way that commons images are used?

We could. That would mean not using content-adressible URLs and the service handling its own cache invalidation instead of relying on the page itself getting invalidated. That's on one hand more flexible (could deal with external APIs with their own invalidation signals, for example), on the other hand a lot more complex (MediaWiki already has a bunch of invalidation logic so we'd just have to add all the dependencies of the graph to the link tables and it would just work).
FWIW images are planned to use content-based URLs at some point (T149847: RFC: Use content hash based image / thumb URLs), mainly I think to allow them to be cached indefinitely on the client.

  • While editing, if a user agent changes a graph spec, that user agent has to be able to simulate rendering the graph and collect all the data it would fetch (Vega can do this, so any agent that can run JS can do it too)

Ideally a preview would result in the same rendered (HTML) content as the final page, so that no re-rendering is needed and preemptive parsing can be used. Which would mean invoking Vega the same way it's invoked on a normal page. (Which might be an UX issue depending how long Vega takes.) Breaking preemptive parsing means worse save performance, not sure how big a deal that would be.

Changing a graph in VisualEditor is a different issue, that's not a preview as far as MediaWiki sees it. VisualEditor might invoke Parsoid to render what's basically a preview, but only when it cannot figure out something on its own (like when a new template is added). For a graph edit that might be avoidable, although it seems nontrivial (you'd have to be able to differentiate between changing the graph spec / local data, which can be handled by just re-running the graph library, and changing graph dependencies (like adding another Commons tabular data file to the spec), which means re-parsing the page.

  • Before the save is complete, POST to graphoid (bikeshed name later) the graph spec and all external resources it needs (this includes icons!, results to wikidata queries, etc.)
  • graphing lambda service responds async with an image path (something in Swift)
  • The image path gets saved as part of the wikitext of the page

If it's async, the page needs to be saved without waiting for the response.

  • The graph spec gets saved in a new MCR slot

I would leave out MCR slots for now. They add complexity, it's a not fully finished system, and there doesn't seem to be a real need for it (sure having a huge graph definition inside wikitext is ugly, but in practice it will be a template anyway).

Yurik added a comment.Apr 15 2020, 8:33 PM

Thx @Tgr and @Milimetric. Reiterating some possibly overlooked points:

  • In addition to data blobs (WDQS, data pages, API calls, etc), graphs could contain images (i.e. Commons or local wiki), and map image snapshots (generated by the maps snapshot service). See examples. If data is "prepackaged", some system would have to call all those services to assemble the needed data.
  • Newer Vega allows data loading as a result of a user action, or as a result of other data loading (e.g. if datasource A returns X, get datasource B)
  • MediaWiki PHP could try parse the graph spec to get all data sources, and we could say that for the preview image, data must not be dynamic, but that still leaves images -- e.g. if the data has country codes, a graph could get corresponding country flag by its name, e.g. File:Icons-flag-<countrycode>.png.
  • Many edits are done by non-javascript clients (bots), so requiring the client to submit some data when saving might introduce too many bugs and data mismatches.
  • Vega is not allowed to get any data from outside of the WMF network (uses a custom data/image loader for that).
Yurik added a comment.Apr 15 2020, 8:36 PM

One more thing: the current custom-protocol:///?someparams=url-encoded for the data sources was a work around of the older Vega limitation. In Kibana, we used a much more successful approach:

"url": {
  "type": "data-page",
  "page": "NYC weather data.tab"
}

The conversion between the "URL" blob and the actual internal URL can be easily done by the custom data loader -- it has to validate them anyway, so could also make it much more readable for the users.

I'm updating the RFC after a very educational IRC meeting. The problem we're trying to solve is much clearer to me, and it has less to do with graphoid's troubles and more with how we reference data. Basically, something needs to fetch the data and something needs to render the graph. So we have two choices:

  1. The new service can do both fetch and render, and we have to prove that the loop of MW -> Graphing Service -> get data from MW can't possibly cascade and does not generate headaches for SRE.
  2. The new service only renders. This means MW has to fetch the data and send it to the service. Since MW can't run arbitrary JS graphing libraries without someone getting very angry, this also means we have to make a custom graphing spec on top of vega, that would define its data dependencies in a clear way that MW can read.

I will detail these choices in the description now (this may take me a while :))

Milimetric updated the task description. (Show Details)Apr 17 2020, 11:25 AM
Milimetric updated the task description. (Show Details)Apr 17 2020, 7:33 PM

Ok, I think I captured all that I've learned in the IRC meeting in the updated description (tweaked problem statement, reworked proposal). Huge thanks again to everyone who asked questions and gave me feedback. If any of you are following this, please check out the updated Proposed Solution section and the Open Questions.

@Milimetric thx for working on this! A few points that I would like further clarification on:

LOOPS (also mentioned by @akosiaris)
It is not possible in the current architecture to cause a cascading loop. Graphoid always does everything in two steps: (1) get the graph definition (spec) from MW using a hash, and (2) get the data and images needed by that definition. Vega could request images based on data. Older Vega could not request data based on other data, but this was a very frequent request for numerous use-cases, so it was added in the newer version.

MediaWiki waiting for external services
Current architecture does not allow for that. The MW parser does not wait for any remote service to complete its work. Graphoid's architecture was first and foremost designed to address that. Has this requirement changed? If so, it will allow a simpler architecture -- the parser will ask Graphoid to render an image, then save that image to swift, and embed the hash of that image into the resulting HTML.

Separating RENDERING and FETCHING
This still sounds like an impossible proposition - one has to go through all the rendering steps in order to compute the requirements. Could you outline the exact benefits of running the rendering twice?

@Yurik you know I love you, and your ideas and goals for the wikis, but I suspect you don't have time to re-read and catch up with all the details. Most of your questions are answered in the updated description, and I'm afraid to repeat because I want to move the conversation forward. I also noted your concerns and will try to surface that information more clearly in the docs. Now I'd like to hear from others about the new direction.

Tgr added a comment.Apr 28 2020, 2:43 PM

I feel this is stalled on lack of clear ops requirements. T211881: graphoid: Code stewardship request mentions a number of code quality issues, but the only architecture issue it brings up is service loops (which is incorrect as Yuri has pointed out). I think a clear problem statement would make this discussion much more focused. Inasmuch as the problem statement is "SREs plan to undeploy Graphoid", we really need their input on what the problems are as they see it.

That said:

  • option 1 couples page rendering with graph rendering and holds up page rendering while dependencies are retrieved. Good for dependency tracking (would be easy to properly put dependencies in the links tables where appropriate, for example), although these are typically cross-wiki dependencies and we don't support change tracking for those. Bad for robustness of page rendering. So that feels like a questionable trade-off. (Maybe not too bad but it would be nice to have numbers. How many extra requests would that result in on a typical page with graphs? How much would page preview / save performance decrease?)
  • option 2 is similar to the current architecture, except it uses the MediaWiki job queue to trigger Graphoid instead of letting the browser doing it directly (or via Swift). That couples graph rendering with page rendering (although not as tightly as option 1), without any advantage over the current setup as far as I can see. Also, in this case the images can't be content-addressible (not by the image content, anyway) since the data dependencies are not available when the URL is rendered, only the graph spec itself. So image the URL points to will need to change periodically so Swift cannot be used as a routing layer the way we use it for normail thumbnails (which can also change periodically but we know when that happens and purge old objects from Swift), unless some sort of garbage collection is implemented.
Yurik added a comment.EditedApr 28 2020, 3:27 PM

If the job queue can call external services and wait for the rendered result, it would simplify the architecture a bit, allow for easier testing, and I am all for it. This assumes job queue itself can store large data blobs, rather than just short strings.

Render steps

  • generates JSON spec
  • create hash from JSON spec
  • generate HTML with the hash as the image ref, i.e. href="https://.../<hash>.png"
  • enque a new image rendering job with both JSON spec and hash as parameters (this assumes job queue can hold such big parameters)

Job queue steps

  • call Graphoid via POST, passing it the JSON spec, and wait for the resulting image
  • store image under hash in Swift (alternatively Graphoid itself can save the image into Swift)

CONs

  • The user may view the newly rendered page before job queue gets to the rendering job, so Swift won't have the image yet. This in theory could be mitigated by some smarter JS code that keeps re-requesting image for some time until it becomes available.
Base added a subscriber: Base.Apr 29 2020, 7:39 AM
Legoktm added a subscriber: Legoktm.Jul 7 2020, 8:27 AM

It's not explicitly clear to me in the proposal, but I'd like to advocate that the graph image URLs are permanent and don't stop working if the graph changes. Currently the Kiwix team (which creates offline content bundles) is running into a problem of fetching all the page content, and then going through all the image URLs and finding that the graph image URLs no longer work (it can easily take a few days between the get all page content and process image URLs steps). If old URLs were guaranteed to keep working, this wouldn't be an issue.

Kelson added a subscriber: Kelson.Jul 7 2020, 8:34 AM

Quick update. I felt a bit like I ran into a wall with the discussion above, and I'm adjusting so I can make progress.

I'll be updating the RFC to reflect:

  • keep the high level architecture largely as-is, with the exception of storing the graph definition (see below). There doesn't seem to be huge problems with the architecture itself, rather with some details that we can sidestep with newer versions of Vega and a fresh perspective.
  • clean up code, add unit tests, pin package versions, eliminate custom protocols, clean up support for the stateless functional endpoint.
  • de-prioritize refreshing of the graph (re-rendering and updating periodically). This can be done out-of-process and it was not the reason Graphoid was undeployed. We essentially can't do this well because our dependency tracking system is not granular enough or scalable enough. We can revisit once it is.

Before I update, I want to clear up how we can store the graph definition in a custom table. The alternatives, as I understand them, are to keep it in page-props and to cache it in the parser cache, but I think that's part of what caused the current mess and forced bad trade-offs. So, we'd have to create a new store (Where? Does it have to be mysql?) and then interact with the parser to move the spec there. At this point, we would not be able to make a CAS-style hash because we haven't fetched the external dependencies. So I wonder if there's a way to write something temporary in the page, like a "loading graph" image with a reference to the graph definition store. And then when the service fetches the external resources and renders the graph, it can update the new store with the hash (and URI of the stored image). Can this be swapped into the rendered page at that time? Via that Swift 404-mechanism? If not, I think this is the functionality we need and the reason this feels so twisted. I'd love to talk this out with whoever's interested, if nobody speaks up I'll crash the next Architecture office hours meeting.

I believe this addresses everything that was raised on the stewardship request without major changes in the rest of the system. So if we still don't think this is a good way forward, then let's try and either propose how we could improve it further or discuss how we can make those larger changes.

And yes, @Legoktm, I think it's a priority to be able to permalink to graph images from old revisions, and thank you for the additional use case and motivation, I'll add that when updating as well.

Krinkle changed the task status from Open to Stalled.Jul 29 2020, 8:40 PM
Krinkle triaged this task as Medium priority.
Krinkle added a subscriber: Krinkle.

Marking as stalled as this seems to be blocked on product decision(s).

Nuria removed a subscriber: Nuria.Jul 29 2020, 8:45 PM

Could anyone outline what specific product decisions are needed?

@kaldari: I cc-ed you on the email I sent to Carol about this. Basically, when this started, it seemed like Product had recognized the importance of Graphoid and was moving towards finding a permanent team that would maintain it. I was willing to fix and maintain it in the meantime. But after I started, all communication from Product stopped. I pinged on Slack and email and haven't received responses. So I am hesitant to continue, right now it feels like if I do manage to get this deployed I'd be the only one maintaining it. I'm happy to do that for a while, but not in the absence of any plan to adopt it officially. If there is no interest to adopt this at WMF, as a volunteer I'd prefer to start a separate site for collaborative data editing and graph generation.

What about
a. Upgrade to Vega 5.0 and Support Vega-lite 4.0
b. Use Vega’s own Server-Side Deployment using Node.js to generate PNG, SVG or Canvas for server-side rendering.
c. If server side did not finishing rendering, we then start client rendering and use Vega 5.15’s Expression Interpreter to ensure CSP.

I strongly against writing a wrapper as such self-maintained will definitely become a headache just like Graphoid.

Milimetric updated the task description. (Show Details)Tue, Nov 24, 11:49 PM
Milimetric added a subscriber: Nuria.