Motivation
Documenting the spread of COVID-19 disease has exposed difficulties collaborating on data and rendering data visualizations. This latter problem is the motivation of this RFC, evidence can be found in T248707, T118783, and others. These front-end problems are related to the graphoid service in a way that not everybody agrees is necessary, so I will lay out the need for both the Graph extension and Graphoid as I see it. This part of the RFC is more complicated and subjective, so discussion and updates are especially welcome.
To draw a graph we need data and a function that does the transformation into visual objects. On top of that we can layer interactivity so people can play with the new representation. Data is usually a few hundred KB of text, but it can get up to tens of MB or more if you're pulling in TopoJSON to render maps and getting multiple dimensions per geographic object. Here are some product research questions and my assumed answers until better data is available:
- how much data will our larger datasets have? Up to 10MB
- how many people choose to interact with a graph? Less than 1%
- how many different representations of data do we need? 99% of graphs will probably be maps, bar graphs, line graphs, and (::shudder::) pie charts
To show why these guesses matter, let's assume a graph rendering a world map like the linked article above. And let's walk through rendering this client-side-only or with the aid of a service. Let's assume 1 million daily views.
- Service pulls in 1MB of data, renders a 60KB image. 99% of clients get the 60KB image and 1% get 1MB of rendering code plus 1MB of data. Bandwidth to serve: 80 GB / client CPU time: 14 hours
- Without a service, each client would pull in 2MB and wait 2-60 seconds depending on their processing power, let's say 5 seconds on average. Bandwidh to serve: 2 TB / client CPU time: 2 months
So even if just one semi-popular article had a single graph, pre-rendering seems necessary. So, let's move on to the existing solution, Graphoid. This service tried to do everything that Vega is capable of, including pulling in and transforming data from external APIs. For this reason, it cut through our infrastructure in ways that left some scars. Problems are documented in the stewardship request, T211881, and mentioned in other RFCs and tasks T119043, T98940, etc. I hope this RFC can be a place where we centralize these discussions. These are my takeaways but do feel free to edit and add your own:
- Stewardship request: T211881. The service integrates too deeply into page rendering. It created a situation where it has to know about the varnish caches, page property caches, parser caches, complicated timelines from when an edit happens to how it affects each cache, etc. And therefore folks working in those areas need to know about Graphoid as well. The next version should try to be as independent and functional as possible. To @akosiaris's point, this will also allow teams to work independently especially now that schedules are reduced. Also, another key point there is testability and performance monitoring. Graphoid calls out to mediawiki in a circular way so testing it in isolation can be tricky.
- thoughts from @Bawolff: https://www.mediawiki.org/wiki/User:Bawolff/Reflections_on_graphs. My takeaway here is that we can and should support other rendering engines outside of Vega. Especially given the assumption about diversity of charts above.
- initial motivation from @Yurik: https://meta.wikimedia.org/wiki/User:Yurik/I_Dream_of_Content. This vision was always great. The solution should not sacrifice the general idea that interactive and visual content are important.
The problems with Graphoid are:
- supports old versions of Vega that require it to implement custom protocols
- hard to upgrade to node 10, currently on node 6 (main reason it's being undeployed)
- hard to maintain because of tight coupling with many and complex parts of our overall system
- parsing: storing the graph definition in page properties violates what page_props were designed for and complicates: parsing a page with a graph tag and graphoid's job of fetching the graph definition
- caching and invalidation: see patches trying to fix this in referenced issues
- hard to test because of calls back to mediawiki API
- noisy and annoying to SRE because of the HTTP errors from external data APIs
The requirements are:
- easy to maintain, makes SRE happy
- as separated from page rendering as possible
- takes as many parameters as possible up front, to limit calls back to Mediawiki
- updating when new data is available or graph is changed should not be made too complicated by the other simplifications
- support rendering latest vega and latest vega-lite from the start (being able to upgrade or add a new renderer should be easy)
Affected Components: Graphoid and future Graph extension versions
Initial Implementation: @Milimetric (me) with @Pchelolo for code reviews and @akosiaris for "hey you're still doing it wrong"s (and the blessing of @Nuria, at least for now :))
Code Steward: @Milimetric until I find a team that enthusiastically wants to support this going forward
(NOTE: there are two types of maps: geographic and infographic. In this context, we're talking about infographic maps, where emphasis is on the data and not on accuracy of boundaries or landmarks. Basically, maps that vega can render as opposed to google maps, those are the domain of a different stack)
Existing plans and the Graph Extension
As of this writing, the Graphoid service is due to be undeployed. This proposal does not seek to block that work, indeed it seems like a good idea to undeploy the current service, make the graph extension capable of working client-side-only, and support it with a more stand-alone service going forward. The graph extension currently supports only Vega and the new service will be flexible. This is where we should keep in touch and do the product work to understand the needs of our communities so we offer the right features.
Indeed, the decision between the following proposals is partly about making the service easy to maintain, and partly about what we can support in the Graph extension. Basically, to render a static image of any graph (vega or not), as Gergo sums it up in T249419#6050917, we need three subsystems:
- fetch external dependencies (data and images like icons)
- render the graph
- store the image somewhere
I'm hearing consensus that storing the image in Swift, ideally on a path that's a hash of the graph spec, so we can essentially have Content-Addressable-Storage. That sounds great to me so let's focus on fetch, render, and control of the process.
I think we only have one real choice to make, and surprisingly it's what the graph spec looks like. Either it's opaque to MW or not. Bryan's idea, to write a CAS image url into the page to decouple MW parsing from graph rendering (T249419#6029620) is great, but the core of it can be applied in all scenarios. I'll use it below and also show it's orthogonal from cache invalidation. Like everything else, the way we handle cache invalidation really depends on whether MW can read the graph spec.
So, one proposal, with two options embedded:
Proposed Solution
- MW (via graph extension) parses the page, sees a graph tag, reads the graph spec
- [OPTION 1]: MW understands the spec, reads external dependencies
- [OPTION 2]: the spec is opaque to MW
- MW creates a content-addressable image url based on a hash of the spec, writes it to the output
- MW queues a jobque job to render the graph at that image url and keeps parsing
- [OPTION 1]: jobque job records external dependencies in *links tables (or similar) to later invalidate pages cached with this graph tag. It then calls the service with data and spec
- [OPTION 2]: jobque job calls the service with only the spec
- service renders the graph, either [OPTION 1] using the passed-in data or [OPTION 2] calling out for data itself
- jobqueue stores the rendered image in Swift, at the image url that MW created
OPTION 1 means we wrap graph specs and declare statically what external resources the graph needs. These can be image URLs, data, etc. The biggest benefit of this is that we can easily let MW track dependencies for cache invalidation. The downside is that we can only support a subset of data visualization libraries like Vega.
OPTION 2 means the service needs to call out for the data. This means we have a loop of MW calling the service and the service calling back to MW. It's technically calling back just to get external resources. But one of those external resources could be data from a page that might have another graph on it, and I can see a malicious user being able to make a loop there. We could prevent this by restricting what allowed external dependencies are, but I think a clear requirement here is coming up with a very good proof to make SRE happy that it's not a real loop. The benefit, of course, is that MW and the service can be much simpler and end users have the most powerful possible graph spec. The other down side is of course lack of dependency tracking. The service could implement its own, but that seems wasteful unless the dependency system is refactored.
While going with OPTION 1 means committing to supporting our own wrapper spec, from the point of view of the service, the endpoints needed to support either option are very simple. The only architectural change would be to stop storing the graph spec in page_props. Besides that, I would clean up the code, upgrade dependencies, and add tests. So we could implement endpoints supporting both options and delay the choice until the graph extension is fully adopted by a team. This is what I'll be doing unless someone comments with strong objections.
Implications for the graph extension are minimal with OPTION 2. With OPTION 1, it would have to concede some flexibility and educate users on the new spec so they can write it. Or, of course, provide a nice editor that does it for them. Some thoughts on the graph extension in general:
- UX of editors. Graph definitions should be easy to edit and validate client-side. Building templates on top of common types of graphs should be easy. MW templates don't seem like a great solution for this, and it's an open question for me whether editors will inevitably use MW templates for graphs or we can prevent it altogether.
- It should be easy to copy and render old versions of graph definitions along with the data they referenced, if the APIs or URLs used allow it. This is where we may need to open up discussion about where we store these artifacts (MCR slots seem to make sense but I gather they're not ready for this use case).
Open Questions
- Should we design a graph spec wrapper spec that can wrap any graph specs and statically expose external dependencies (OPTION 1 above)?
- Where should we store the graph spec, given that it can be generated from templates? Should we disallow generating it from templates (can we even do that)?
- Security considerations for private wikis (see task for previous Graphoid implementation: T90526)
Tagging other people that might be interested in this discussion:
- @kaldari is working on a bot to update some of the most needed and critical graphs, that bot could use this service before even the graph extension
- @Tnegrin is project managing the immediate response to graphoid/graph extension problems
- @Abit is leading a weekly sync
- @Seddon who is working on making the graph extension client-side-only