=Motivation=
[[ https://en.wikipedia.org/wiki/2019%E2%80%9320_coronavirus_pandemic | Documenting the spread of COVID-19 ]] disease has exposed difficulties collaborating on data and rendering data visualizations. This latter problem is the motivation of this RFC, evidence can be found in T248707, T118783, and others. These front-end problems are related to the graphoid service in a way that not everybody agrees is necessary, so I will lay out the need for both the Graph extension and Graphoid as I see it. This part of the RFC is more complicated and subjective, so discussion and updates are especially welcome.
To draw a graph we need data and a function that does the transformation into visual objects. On top of that we can layer interactivity so people can play with the new representation. Data is usually a few hundred KB of text, but it can get up to tens of MB or more if you're pulling in TopoJSON to render maps and getting multiple dimensions per geographic object. Here are some product research questions and my assumed answers until better data is available:
- how much data will our larger datasets have? Up to 10MB
- how many people choose to interact with a graph? Less than 1%
- how many different representations of data do we need? 99% of graphs will probably be maps, bar graphs, line graphs, and (::shudder::) pie charts
To show why these guesses matter, let's assume a graph rendering a world map like the linked article above. And let's walk through rendering this client-side-only or with the aid of a service. Let's assume 1 million daily views.
- Service pulls in 1MB of data, renders a 60KB image. 99% of clients get the 60KB image and 1% get 1MB of rendering code plus 1MB of data. Bandwidth to serve: 80 GB / client CPU time: 14 hours
- Without a service, each client would pull in 2MB and wait 2-60 seconds depending on their processing power, let's say 5 seconds on average. Bandwidh to serve: 2 TB / client CPU time: 2 months
This makes the case for pre-rendering graphs on the server pretty easy to makeSo even if just one semi-popular article had a single graph, based on just one articlepre-rendering seems necessary. So, let's move on to the existing solution, Graphoid. This service tried to do everything that Vega is capable of, including pulling in and transforming data from external APIs. For this reason, it cut through our infrastructure in ways that left some scars. Problems are documented in the stewardship request, T211881, and mentioned in other RFCs and tasks T119043, T98940, etc. I hope this RFC can be a place where we centralize these discussions. These are my takeaways but do feel free to edit and add your own:
- Stewardship request: T211881. The service integrates too deeply into page rendering. It created a situation where it has to know about the varnish caches, page property caches, parser caches, complicated timelines from when an edit happens to how it affects each cache, etc. And therefore folks working in those areas need to know about Graphoid as well. The next version should try to be as independent and functional as possible. To @akosiaris's point, this will also allow teams to work independently especially now that schedules are reduced. Also, another key point there is testability and performance monitoring. Graphoid calls out to mediawiki in a circular way so testing it in isolation can be very tricky.
- thoughts from @Bawolff: https://www.mediawiki.org/wiki/User:Bawolff/Reflections_on_graphs. My takeaway here is that we can and should support other rendering engines outside of Vega. Especially given the assumption about diversity of charts above.
- initial motivation from @Yurik: https://meta.wikimedia.org/wiki/User:Yurik/I_Dream_of_Content. This vision was always great. The solution should not sacrifice the general idea that interactive and visual content are important.
The problems with Graphoid are:
- supports old versions of Vega that require it to implement custom protocols
- hard to upgrade to node 10, currently on node 6 (main reason it's being undeployed)
- hard to maintain because of tight coupling with many and complex parts of our overall system
- hard to test because of calls back to mediawiki API
- lots of HTTP errors connect- noisy and annoying to external APIs for data, making this a noisy service and annoying to SRESRE because of the HTTP errors from external data APIs
The requirements are:
- as isolated from page rendering as possible- easy to maintain, makes SRE happy
- takes all required - as separameters upted front so as to not have to make calls to Mediawikim page rendering as possible
- takes all datas many parameters as possible up front so as to not need external APIs, to limit calls back to Mediawiki
- updating when new data is available or graph is changed should not be made too complicated by the other simplifications
- support rendering latest vega and latest vega-lite from the start (being able to upgrade or add a new renderer should be easy)
**Affected Components**: Graphoid and future Graph extension versions
**Initial Implementation**: @milimetric (me) with @Pchelolo for code reviews and @akosiaris for "hey you're still doing it wrong"s (and the blessing of @Nuria, at least for now :))
**Code Steward**: @milimetric until I find a team that enthusiastically wants to support this going forward
(NOTE: there are two types of maps: geographic and infographic. In this context, we're talking about infographic maps, where emphasis is on the data and not on accuracy of boundaries or landmarks. Basically, maps that vega can render as opposed to google maps, those are the domain of a different stack)
=Existing plans and the Graph Extension=
As of this writing, the Graphoid service is due to be undeployed. This proposal does not seek to block that work, indeed it seems like a good idea to undeploy the current service, make the graph extension capable of working client-side-only, and support it with a more stand-alone service going forward. The graph extension currently supports only Vega and the new service will be flexible. This is where we should keep in touch and do the product work to understand the needs of our communities so we offer the right features.
=Proposed Solution A=Indeed, the decision between the following proposals is partly about making the service easy to maintain, and partly about what we can support in the Graph extension. Basically, to render a static image of any graph (vega or not), as Gergo sums it up in T249419#6050917, we need three subsystems:
Do this in two phases1. fetch data needed for a first render
2. render the graph
3. store the image somewhere
I'm hearing consensus that storing the image in Swift, ideally on a path that's a hash of the graph spec, so we can essentially have Content-Addressable-Storage. So let's leave it there for storage, and focus on fetch, render, and control of the process.
I only see one real choice, and surprisingly it's what the graph spec looks like. Either it's opaque to MW or not. Bryan's idea from T249419#6029620 is great but the core of it can be applied in all scenarios, I think, and is orthogonal from cache invalidation. Phase 1 would be the simplest possible lambda service:We come up with a content-addressable hash and write it to the page, letting everything else happen async.
So, one proposal, with two options embedded:
=Proposed Solution=
- Input => (HTTP Post)- MW (via graph extension) parses the page, sees a graph tag, reads the graph spec
- Graph spec. For vega, this would be- [OPTION 1]: MW understands the json definitionspec, and it would include the data so there is some work here to deal with specs that include URIsreads external dependencies
- Fully realized data, including artifacts like TopoJSON, country codes, etc- [OPTION 2]: the spec is opaque to MW
- Rendering Engine. For example: vega1- MW creates a content-addressable image url based on a hash of the spec, vega5, vega-lite1, or bar-chart (the implications of that last option are a big topic so initially just vega*)writes it to the output
- MW queues a jobque job to render the graph at that image url and keeps parsing
- Rendered Image options[OPTION 1]: jobque job records external dependencies in *links tables (or similar) to later invalidate pages cached with this graph tag. Size, format, etc.It then calls the service with data and spec
- Output <= rendered graph image
A graph extension updated to work with this service would have to go through this process:
- realize all the data required by calling all the APIs and URLs referenced in the spec provided by the editor, including any transformations[OPTION 2]: jobque job calls the service with only the spec
- update the spec to refer to the data directly and post along with the other inputs to the service. This will be the hardest part, and will have to be done client-side before saving. It's a new type of interaction for editors (I think?) but it will deliver many more advantages in terms of support for graphs overallservice renders the graph, due to simplified maintenanceeither [OPTION 1] using the passed-in data or [OPTION 2] calling out for data itself
- save the resulting image URL with the page. Here we probably want to save the original Vega spec before materializing the data, because it's more succinctjobqueue stores the rendered image in Swift, at the image url that MW created
OPTION 1 means we wrap graph specs and declare statically what external resources the graph needs. But we could also not save it at all and punt this part to Phase 2 and open a discussion about a new MCR slot. Not saving it at all means more work for editors to draft and save it on their ownThese can be image URLs, data, etc. Because of this complexity, we may want to delay integration with the graph extension until Phase 2The biggest benefit of this is that we can easily let MW track dependencies for cache invalidation. The downside is that we can only support a subset of data visualization libraries like Vega.
It's important to keep in mind that this will happen just during the Save processOPTION 2 means the service needs to call out for the data. This means we have a loop of MW calling the service and the service calling back to MW. It's technically calling back just to get external resources. But one of those external resources could be data from a page that might have another graph on it, and I can see a malicious user being able to make a loop there. We could prevent this by restricting what allowed external dependencies are, but I think a clear requirement here is coming up with a very good proof to make SRE happy that it's not a real loop. During previewThe benefit, of course, the graph extension will be loaded with all its client-side renderis that MW and the service can be much simpler and end users have the most powerful possible graph spec. The other down side is of course lack of dependency tracking. So refreshing previews will not hit the serviceThe service could implement its own, but that seems wasteful unless the dependency system is refactored.
In Phase 2, we do work around the service to allow a futurImplications for the graph extension to integrate easily with itare minimal with OPTION 2. We need to solve how we store the graph definitionWith OPTION 1, data, and how to store and purge the resulting images. As I understand from @Pchelolo the evolution here will benefit a lot from lessons learned with Mathoid. This implementation should closely track the latest thoughts on Mathoid and take advantage of existing mechanismsit would have to concede some flexibility and educate users on the new spec so they can write it. Or, of course, provide a nice editor that does it for them. Some goals would be:thoughts on the graph extension in general:
- UX of editors. Graph definitions should be easy to edit and validate client-side. Building templates on top of common types of graphs should be easy. MW templates don't seem like a great solution for this, and it's an open question for me whether editors will inevitably use MW templates for graphs or we can prevent it altogether.
- It should be easy to copy and render old versions of graph definitions along with the data they referenced, if the APIs or URLs used allow it. This is where we may need to open up discussion about where we storinge these artifacts in an MCR slotts (MCR slots seem to make sense but I gather they're not ready for this use case).
Tagging other people that might be interested in this discussion:
- @kaldari is working on a bot to update some of the most needed and critical graphs, that bot could use this service before even the graph extension
- @Tnegrin is project managing the immediate response to graphoid/graph extension problems
- @Abit is leading a weekly sync
- @Seddon who is working on making the graph exex
\
tension client-side-only