Page MenuHomePhabricator

graphoid: Code stewardship request
Closed, ResolvedPublic

Description

Intro

The Graphoid service [1] was introduced in the infrastructure a few years ago (3 years 8 months per f56e53bb2d). It is meant as a fallback mechanism for clients that don't support javascript or all the new javascript and HTML5 features that the vega library[2] requires. As a fallback mechanism it generates a PNG out of a graph that would normally be served as a set of HTML elements. It is also very useful to keep the data a client needs to download from our servers to a minimum as it allows displaying a graph as png and avoid having to download the entire vega library.

Issues

During the migration of the service to the kubernetes-based deployment pipeline a number of issues became evident which hindered and effectively paused the migration.

The issues identified were:

An unorthodox architecture of the API of the service.

See https://www.mediawiki.org/api/rest_v1/#!/Page_content/get_page_graph_png_title_revision_graph_id for the specification.

The graphoid service fetches the graph from the mediawiki API using essentially mediawiki pages as a data store. However to identify the required graph it requires knowing the identity of the graph which is essentially the hash of the graph.

However, that begs a question. If the caller of the graphoid service API already knows the hash of the graph, that means they either got it via talking to mediawiki, or that they calculated it themselves (aka they already have the entire graph). If they already have the entire graph, why not POST it to the service and obtain the resulting PNG themselves? Things become even more convoluted if the user is mediawiki itself (that's to my knowledge actually the case) at which case having mediawiki instruct another service to create requests back to it (the API endpoint actually) causes a cascading number of requests hitting the mediawiki API. Up to now this hasn't caused an outage, in my opinion simply because of the low traffic the graphoid receives.

This tight coupling of the 2 component (graphoid+mediawiki), makes benchmarking of the service unnecessarily risky and difficult. Having to obtain a number of titles+revisions+hashes beforehand is an unnecessary nuisance. Add in the risk that benchmarking the graphoid service might cause undue load to the mediawiki API and it essentially makes the operating parameters of the service unknown, meaning that any kind of support to it by anyone can be best-effort.

Connections to the public API endpoints

Graphoid does not support our discovery endpoints (e.g. api-rw.discovery.wmnet) but rather talks directly to the public LVS endpoints (en.wikipedia.org). That has a numbers of consequences

  • It makes it difficult to point graphoid to another mediawiki cluster/DC/availability-zone making operations more difficult.
  • Potentially populates the public edge caches with content that may or may not be requested
  • Is prone to cache invalidation issues.

Unconventional protocol schemes

Another result of the need to talk to the mediawiki API and at the same time support for data in it has created an unwieldy number of network protocol schemes. For reasons yet unknown to me, HTTPS urls are(were?) not supported in trusted graphs by vega, nor were protocol relative urls (e.g. //url_to_whatever). Protocol relative urls are already considered an anti-pattern anyway (since 2014) [3], but for some reason support for this was implemented.

An example can be seen at https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/graphoid/+/refs/heads/master/config.dev.yaml#72 were 2 extra protocols are defined (wikiuploadraw and wikidatasparql). The only documentation to these is at https://www.mediawiki.org/wiki/Extension:Graph#External_data which is not explain what or why. Looking at the code [4] reveals a whole set of more were originally conceived but are not documented (and probably used).

Dependencies are not versioned

Not all dependencies in package.json are version pinned [5]. This, bundled with the fact the vega library had undergone significant changes, meant that in May 2018, when the first efforts to add graphoid in the service pipeline started failed. In the end the npm shrinkwrap approach was followed in order to succeed, but it's not exactly great.

Rubric

  • Current maintainer

? @Yurik is the original author and maintainer in the past

  • Number, severity and age of known and confirmed security issues

0

  • Was it a cause of production outages or incidents. List them

No, but with such a low rate of requests that's explainable

  • Does it have sufficient hardware resources for now and the near future (to take into account expected usage growth)?

No. The scb cluster where graphoid is currently residing is already deprecated and slated to be decomissioned. Services on that cluster are meant to be deployed on the deployment pipeline and kubernetes.

  • Is it a frequent cause of monitoring alerts that need action, and are they addressed timely and appropriately?

No

  • When it was first deployed to Wikimedia production

2015

  • Usage statistics based on audiences served

https://grafana.wikimedia.org/d/000000068/restbase?panelId=15&fullscreen&orgId=1 points to < 1 rps, https://grafana.wikimedia.org/d/000000021/service-graphoid?panelId=11&fullscreen&orgId=1 points to more, ~ 50rps (that's excluding monitoring)

  • Changes commited in the last 1, 3, 6 and 12 months

1 month. => 1
3 months => 3
6 months => 3
12 months =>6

  • Reliance on outdated platforms (e.g. operating systems)

The service relies on an old version of the vega[2] library, namely vega 1 (1.5.3) and vega2 (2.6.4) as well a few vega plugins. Vega 3 is out and is being evaluated at T172938 which hasn't seen any traffic in almost a year.

  • Number of developers who committed code in the last 1, 3, 6, and 12 months
  1. me (@akosiaris) and @mobrovac for the process of migrating the service to the deployment pipeline. No functionality was added/removed, no bugs fixed, nor anything else.
  • Number and age of open patches

1 and it's a deployment pipeline change

  • Number and age of open bugs

34 tasks at https://phabricator.wikimedia.org/project/board/1191/

  • Number of known dependencies

None.

  • Is there a replacement/alternative for the feature? Is there plan for replacement

Graphoid per [1] is already a fallback mechanism that was meant to address old and non javascript supporting browsers. To my understanding, the plan was always to use it as a stopgap while the browser population caught up with the new features required by vega[2] and then be removed from the infrastructure.

  • Submitter's recommendation (what do you propose be done?)

My personal take would be to investigate whether the service still serves a valuable purpose and based on the numbers decide whether to remove it from the infrastructure or not. If it has served it's function, that's great. If it's still useful and we do decide to keep it around, a significant amount of work will have to done to address the issues I 've highlighted above. Whether that is worth it or not, is a good question.

[1] https://www.mediawiki.org/wiki/Extension:Graph#Graphoid_service
[2] http://vega.github.io/
[3] https://www.paulirish.com/2010/the-protocol-relative-url/
[4] https://github.com/nyurik/mw-graph-shared/blob/master/src/VegaWrapper.js#L193
[5] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/graphoid/+/refs/heads/master/package.json

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In any case, and with the risk of repeating myself, the service is under a code stewardship request cause it does not have currently a maintainer. If we want to keep it around, we need first and foremost a maintainer.

@akosiaris From what I gathered in this discussion, the course of action on this should be to find a Code Steward for graphoid.

Yes, agreed.

Funding/resources aside, is there a logical home for graphoid?

I don't think I can answer that :(

Given the SCB deprecation and the incoming work load to re-architect this to work on Kubernetes to whoever takes on Graphoid maintenance, maybe it would be useful to outline what that work would be?

Definitely,

The graphoid repo itself doesn't really seem that complicated, given the only thing it seems to do is render vega data models. There is not much code in the repo either.

Please let me know if I'm misunderstanding something but the trickier issue seems to be that vega will do data fetching for you to fully expand certain data models depending on how they are configured (1)? If this is the case then that seems to be core feature of the service.

I am not sure I understand the question. Are you asking if being able to fetch graph/data model specs from a URL is a problem or a feature (or even both :) )?

  • What is the solution for a service that needs to make external requests when deployed to kubernetes?

Define external. External as in outside the service or outside the WMF infrastructure? Note that kubernetes doesn't really change much the way requests are being made, aside from a few hardening details. In kubernetes we block all outgoing requests and only allow specific destinations. If above said destinations are in the WMF infrastructure then we add them to the whitelist. Specifically for mediawiki, we want the service to use our internal discovery URLs (https://api-rw.discovery.wmnet) with the corresponding HTTP Host: header for the project being accessed, for easier operational actions (being able to depool an entire DC for a service) and avoiding of polluting the edge caches by mistake. If the requests are outside of the WMF infrastructure they need to go via a proxy (also to be added to the whitelist). This is to decrease the impact of a compromised service.

By the way, any kind of changes/work would be probably unrelated to kubernetes. The execution patterns (the extra hardening describe above aside) are more or less the same as far as the software is concerned.

  • Can this be solved at the infrastructure level and configuration of the service, or does it need to be resolved by changing the core feature of the service itself?

Again I am not sure I understand (probably because I am lacking some part of the picture). What kind of configuration changes would you see on the service that solves the problem?

I personally fail to see how this would be refactored, the only thing that comes to mind is parsing the vega model in PHP before rendering the fallback image and expanding any data urls to have the actual data before printing a URL in the HTML, but that is a terrible use of PHP and then the service itself would render some vega models only.

I don't have a great answer either (again probably because I am missing parts of the picture). Assuming time, resources and the capacity to change it, I would probably try to slowly refactor graphoid to become more of a lambda function that just consumes data models via POST and returns the images, probably storing those images in Swift, with mediawiki (or any other interested client in the future) just POSTing via the jobqueue (and that's already an optimization to avoid timeouts).

But that's the greater overall architectural non fully fleshed out vision I have in mind and there's probably a lot of details and the devils is in them. I expect there is functionality that would either need to be dropped or refactored in this vision.

That being said there seems to be a lot of other work even without changing the architecture that the maintaining team would need to do, namely upgrading the vega library, sorting out and upgrading dependencies, using well pinned versions in package.json and not pulls from github master branches, fixing the public API endpoints problem.

Sorry if I'm missing something, but understanding what the future maintainers are going to be asked to do with the project seems quite relevant to considering on taking on its maintenance.

Fully agreed. I 'd like to help as much as possible to figure that out, but again I don't have the full picture either.

The graphoid repo itself doesn't really seem that complicated, given the only thing it seems to do is render vega data models. There is not much code in the repo either.

Please let me know if I'm misunderstanding something but the trickier issue seems to be that vega will do data fetching for you to fully expand certain data models depending on how they are configured (1)? If this is the case then that seems to be core feature of the service.

I am not sure I understand the question. Are you asking if being able to fetch graph/data model specs from a URL is a problem or a feature (or even both :) )?

I'm stating that graphoid via vega (1) and mw-graph-shared (2), will do fetches to fill up the vega data model when they have urls (3), which is a core feature of graphs on the wiki, and as such the server side renderer uses it.

My question was if this is the deal-breaker with the kubernetes migration, or if it is something else (querying mediawiki with the graph id to get the full vega data model, or something else).

  • What is the solution for a service that needs to make external requests when deployed to kubernetes?

Define external. External as in outside the service or outside the WMF infrastructure? Note that kubernetes doesn't really change much the way requests are being made, aside from a few hardening details. In kubernetes we block all outgoing requests and only allow specific destinations. If above said destinations are in the WMF infrastructure then we add them to the whitelist. Specifically for mediawiki, we want the service to use our internal discovery URLs (https://api-rw.discovery.wmnet) with the corresponding HTTP Host: header for the project being accessed, for easier operational actions (being able to depool an entire DC for a service) and avoiding of polluting the edge caches by mistake. If the requests are outside of the WMF infrastructure they need to go via a proxy (also to be added to the whitelist). This is to decrease the impact of a compromised service.

I guess then it would be needed to define if the requests to these domains is to be treated as internal or external. I believe the exact same requests can be done by JavaScript on the browsers.

Thanks for clarifying the discovery URL part. If those requests were to be treated as internal, it seems doable by modifying mw-graph-shared.

By the way, any kind of changes/work would be probably unrelated to kubernetes. The execution patterns (the extra hardening describe above aside) are more or less the same as far as the software is concerned.

They could be, but it seems like the problems and motivation for this is the issues uncovered by the constraints of our kubernetes deployment. So while theoretically unrelated, they seem very much related in this case.

I personally fail to see how this would be refactored, the only thing that comes to mind is parsing the vega model in PHP before rendering the fallback image and expanding any data urls to have the actual data before printing a URL in the HTML, but that is a terrible use of PHP and then the service itself would render some vega models only.

I don't have a great answer either (again probably because I am missing parts of the picture). Assuming time, resources and the capacity to change it, I would probably try to slowly refactor graphoid to become more of a lambda function that just consumes data models via POST and returns the images, probably storing those images in Swift, with mediawiki (or any other interested client in the future) just POSTing via the jobqueue (and that's already an optimization to avoid timeouts).

But that's the greater overall architectural non fully fleshed out vision I have in mind and there's probably a lot of details and the devils is in them. I expect there is functionality that would either need to be dropped or refactored in this vision.

That being said there seems to be a lot of other work even without changing the architecture that the maintaining team would need to do, namely upgrading the vega library, sorting out and upgrading dependencies, using well pinned versions in package.json and not pulls from github master branches, fixing the public API endpoints problem.

Thanks, that is a good summary of the sort of maintenance work this comes with the project.

The graphoid repo itself doesn't really seem that complicated, given the only thing it seems to do is render vega data models. There is not much code in the repo either.

Please let me know if I'm misunderstanding something but the trickier issue seems to be that vega will do data fetching for you to fully expand certain data models depending on how they are configured (1)? If this is the case then that seems to be core feature of the service.

I am not sure I understand the question. Are you asking if being able to fetch graph/data model specs from a URL is a problem or a feature (or even both :) )?

I'm stating that graphoid via vega (1) and mw-graph-shared (2), will do fetches to fill up the vega data model when they have urls (3), which is a core feature of graphs on the wiki, and as such the server side renderer uses it.

Ok. thanks for the clarification.

My question was if this is the deal-breaker with the kubernetes migration, or if it is something else (querying mediawiki with the graph id to get the full vega data model, or something else).

No, it is not a deal-breaker for the kubernetes migration per se, but it does complicates things.

  • What is the solution for a service that needs to make external requests when deployed to kubernetes?

Define external. External as in outside the service or outside the WMF infrastructure? Note that kubernetes doesn't really change much the way requests are being made, aside from a few hardening details. In kubernetes we block all outgoing requests and only allow specific destinations. If above said destinations are in the WMF infrastructure then we add them to the whitelist. Specifically for mediawiki, we want the service to use our internal discovery URLs (https://api-rw.discovery.wmnet) with the corresponding HTTP Host: header for the project being accessed, for easier operational actions (being able to depool an entire DC for a service) and avoiding of polluting the edge caches by mistake. If the requests are outside of the WMF infrastructure they need to go via a proxy (also to be added to the whitelist). This is to decrease the impact of a compromised service.

I guess then it would be needed to define if the requests to these domains is to be treated as internal or external. I believe the exact same requests can be done by JavaScript on the browsers.

From the point of view of the infrastructure, most of them are internal, but some, namely wmflabs.org, vega.github.io, wdqs-test.wmflabs.org, upload.beta.wmflabs.org are external. That being said, requests to all of the internal ones are not done via the discovery URLs, something that should be changed.

Thanks for clarifying the discovery URL part. If those requests were to be treated as internal, it seems doable by modifying mw-graph-shared.

By the way, any kind of changes/work would be probably unrelated to kubernetes. The execution patterns (the extra hardening describe above aside) are more or less the same as far as the software is concerned.

They could be, but it seems like the problems and motivation for this is the issues uncovered by the constraints of our kubernetes deployment. So while theoretically unrelated, they seem very much related in this case.

I think that the kubernetes migration ultimately uncovered the fact the software and installation is unmaintained, e.g. when trying to build a docker image and failing. We did work around that, but it was a telltale sign. But the issue would have arisen anyway when trying to upgrade the current hardware cluster the service resides on to a newer Debian version. A code stewardship request would eventually have been filed then as well.

The documenting of the architectural issues is mostly me diving deeper trying to figure out why I could not benchmark the software (and yes that is related to the kubernetes migration as we want to schedule workloads on it that have well defined resources usages).

I personally fail to see how this would be refactored

When I alluded to this above, I personally just meant the code itself could use a little cleaning up. There's a bunch of copy/pasted code that deals with the small differences between Vega 1 and Vega 2, and adding Vega 4 and Vega Lite on top of that would just make it worse. And more importantly tests are too fragile and not working. Because it's not a lot of code, it seems easier to re-write with a more general approach to current and future possible versions of Vega, as well as better testing.

The graphoid service fetches the graph from the mediawiki API using essentially mediawiki pages as a data store. However to identify the required graph it requires knowing the identity of the graph which is essentially the hash of the graph.

Perhaps this is offtopic, but this architecture always made me uncomfortable, as MW does not guarantee that the current view of a page is the same as the canonical view, but pp_value will always be for the canonical view. This can lead to subtle (and not-so subtle) inconsistencies with graphs. The whole page props system was never meant to, nor is it appropriate to store data necessary to render the currently viewed page, and this whole system is an abuse of that API imo.

Note that Graphoid already supports POST approach (see v2 Graphoid api). You can post a graph spec to it, and it will return an image. But v2 was never actually used by any services, even though it is in production, just not exposed to end users. The issue with v2 is that we don't want to make MW parser (HTML rendering) dependent on an external service because of potential (unverified) speed concern - we need to generate some graph image URL without waiting for Graphoid to actually generate that image.

So the HTML generator needs to store the graph JSON somewhere under some ID, use that ID in HTML image URL, and later an image request needs to be routed to Graphoid to actually generate that image. As an optimization we can use cache seeding -- MW parser can send JSON to Graphoid as well without waiting for the response, and Graphoid/wrapper would cache that image until requested by the browser.

P.S. @Bawolff I might be wrong, but I think at the time of first Graphoid version, pageprops were always in sync with the canonical view, until a bit later when some things moved to jobque.

P.S. @Bawolff I might be wrong, but I think at the time of first Graphoid version, pageprops were always in sync with the canonical view, until a bit later when some things moved to jobque.

The moving linksupdate to the job queue made the problem more noticeable by introducing latency, but the problem always existed, way back to when the page props table was introduced. Fundamentally, there is not a 1:1 relationship between the current version of the page and a view (parserOutput) of that page. For any given page, you can have hundreds of variants for the rendering of the page (depending on a number of things, like user language). You can write a wiki page that has a different graph on it depending on what the user selects in Special:Preferences (For some preferences, not all of them). But only the canonical version [basically default prefs] gets stored in the page_props table. Even worse, you can make the rendering non-determistic so it is different every time it is rendered.

99% of the time this is unnoticeable. People doing normal things should mostly not be triggering it. But its still going to be a source of subtle inconsistencies and not what I would describe as the most sound architecture.

Okay, this has been sitting in draft for too long, so I'm going to provide this simply so that we have it here and people have access to some queries in case they're looking around later. I was intending to share some numbers then follow up about product aspects on a separate post. But I've been rather busy with planning and anticipate continuing to be quite busy with more planning. We'll be discussing potential product / support aspects separately in a forthcoming meeting.

No doubt graphs are informative and it's nice to have rasters, it's all about maintenance cost tradeoffs as compared to, for example, off-production tools or browser-based rasterizing gadgets that could fill the same sort of need (you only need client side JS if you take away the rasterizing problem)...or simply using tables where they'd be a good enough solution for conveying information to the reader.

Something I was wondering about is for the cases where Graph is principally being used to superimpose a red dot on a map or otherwise superimpose other things on a map, is there a pure maps-y way to do that which can achieve a similar user benefit?

And could people live without the static renders of the pageview graphs on the talk pages? For example, would it be possible to use a JS control by default but have a table for JS incapable clients?

Okay, the original post, semi-complete:

As @akosiaris notes there's a small set of wikis with a lot of pageprops indications of graphs. But there aren't necessarily lots of pages with actual <graph> tags - that, plus the nature of the distribution of traffic for pages bearing graph output, partly explains the relatively modest web access phenomenon.

I'm less certain about the use of the tech in WDQS, but maybe someone here knows more.

Let's look at enwiki for a moment.

We can get a feel of where <graph> tags exist with a search query.

There's a grand total of 182 pages bearing <graph> in the source according to that primitive search query (that includes a low level of non-executing documentation).

113 are in User namespace. 39 of those invocations are in Article namespace. 8 are in Wikipedia namespace. 5 of them are in Template namespace; 4 of those 5 templates have just one article associated with them bearing the <graph> tag. The other 17 other usages of the tag are scattered about.

So those are interesting for tracking down obvious examples.

The general distribution of the graph output is in pageprops like @akosiaris showed in the pastes. Here's the namespace breakdown:

mysql: [enwiki]> select count(*), page.page_namespace from page join page_props
 on page.page_id=page_props.pp_page where pp_propname='graph_specs' group by page.page_namespace;
count(*)        page_namespace
1462    0
11427   1
417     2
1038    3
137     4
90      5
4       7
102     10
23      11
44      15
30      101
3       118
9       119
4       828
2       829

The Template namespace is pretty interesting. As noted, 4 of the 5 templates bearing the <graph> string just have one article associated with them. But the other 1 of those 5 templates has 1,182 articles associated with it - it's https://en.wikipedia.org/wiki/Template:Graph:Street_map_with_marks. As an example, https://en.wikipedia.org/wiki/East_Germany uses Template:OSM Location map which in turn uses Template:OSM Location map/core which in turn uses Template:Graph:Street map with marks that has a <graph> string in it and the machinery of cartographic mapsnapshot:///, invocation of Module:TNT and syntax to superimpose a dot over the cartographic mapsnapshot.

There are other more straightforward invocations not necessarily bearing the <graph> string, but rather employ the #tag:graph parser function (more examples at #tag:graph search). Many of those start with "Template:Graph" so that got me to thinking it would be interesting to see how those are laid out.

mysql: [enwiki]> select tl_title, tl_from_namespace, count(*) from templatelinks where tl_title like 'Graph:%' and tl_namespace=10 group by tl_from_namespace, tl_title;
tl_title        tl_from_namespace       count(*)
Graph:BTS_Skytrain_ridership    0       1
Graph:Chart     0       292
Graph:Chart/styles.css  0       323
Graph:Highlighted_world_map_by_country  0       2
Graph:Highlighted_world_map_by_country/Inner    0       2
Graph:Lines     0       32
Graph:Major_League_Soccer_Season_Records        0       3
Graph:Moscow_Metro_expansion    0       2
Graph:Most_Expensive_Books      0       1
Graph:Most_Expensive_Cars       0       1
Graph:Most_Expensive_Paintings  0       1
Graph:PieChart  0       3
Graph:Population_history        0       2
Graph:Seattle_Sounders_FC_(MLS)_seasons 0       2
Graph:Street_map_with_marks     0       1097
Graph:US_Map_state_highlight    0       1
Graph:Weather_monthly_history   0       1
Graph:Chart     1       20
Graph:Chart/styles.css  1       19
Graph:PageHistory       1       3
Graph:PageViews 1       11398
Graph:PieChart  1       1
Graph:Street_map_with_marks     1       6
Graph:Chart     2       94
Graph:Chart/sandbox     2       1
Graph:Chart/sandbox/styles.css  2       1
Graph:Chart/styles.css  2       54
Graph:Highlighted_world_map_by_country  2       1
Graph:Highlighted_world_map_by_country/Inner    2       1
Graph:Lines     2       5
Graph:Map_with_marks    2       1
Graph:Moscow_Metro_expansion    2       1
Graph:PageHistory       2       2
Graph:PageViews 2       129
Graph:PieChart  2       3
Graph:Population_history        2       1
Graph:Street_map_with_marks     2       41
Graph:Weather_monthly_history   2       1
Graph:Chart     3       114
Graph:Chart/styles.css  3       85
Graph:PageViews 3       91
Graph:PieChart  3       1
Graph:Street_map_with_marks     3       5
Graph:Chart     4       46
Graph:Chart/styles.css  4       40
Graph:Lines     4       1
Graph:PageViews 4       34
Graph:Street_map_with_marks     4       5
Graph:US_Map_state_highlight    4       5
Graph:Weather_monthly_history   4       3
Graph:Chart     5       11
Graph:Chart/styles.css  5       7
Graph:PageViews 5       52
Graph:Street_map_with_marks     5       2
Graph:PageViews 7       4
Graph:Chart     10      31
Graph:Chart/doc 10      3
Graph:Chart/sandbox     10      1
Graph:Chart/sandbox/styles.css  10      1
Graph:Chart/styles.css  10      30
Graph:Highlighted_world_map_by_country  10      5
Graph:Highlighted_world_map_by_country/Inner    10      6
Graph:Highlighted_world_map_by_country/Inner/Worldmap2c-json    10      1
Graph:Highlighted_world_map_by_country/doc      10      1
Graph:Lines     10      2
Graph:Lines/doc 10      1
Graph:Major_League_Soccer_Season_Records        10      1
Graph:Map_with_marks    10      2
Graph:Map_with_marks/doc        10      1
Graph:Moscow_Metro_expansion/doc        10      1
Graph:PageHistory/doc   10      1
Graph:PageViews 10      7
Graph:PageViews/doc     10      1
Graph:PieChart  10      2
Graph:PieChart/doc      10      1
Graph:Population_history        10      2
Graph:Population_history/doc    10      1
Graph:Stacked   10      2
Graph:Stacked/doc       10      1
Graph:Street_map_with_marks     10      22
Graph:Street_map_with_marks/doc 10      1
Graph:USL_Championship_clubs_timeline/doc       10      1
Graph:US_Map_state_highlight    10      2
Graph:US_Map_state_highlight/doc        10      1
Graph:Weather_monthly_history   10      2
Graph:Weather_monthly_history/doc       10      1
Graph:Chart     11      1
Graph:Chart/styles.css  11      3
Graph:Lines     11      1
Graph:PageViews 11      17
Graph:PieChart  11      2
Graph:Stacked   11      1
Graph:Street_map_with_marks     11      1
Graph:PageViews 15      44
Graph:PageViews 101     30
Graph:Street_map_with_marks     118     2
Graph:PageViews 119     9
Graph:Chart     828     4
Graph:Chart     829     1
Graph:Street_map_with_marks     829     1

You'll note that Template:Graph:PageViews and the other one I already mentioned, Template:Graph:Street_map_with_marks, are rather heavily used. Interestingly, Template:Graph:PageViews is not used in the main namespace 0. As many may know, you're more likely to see it on the talk pages of some pages, often rather popular pages.

Hi all - we had a meeting on this and a question emerged. Is it possible to run Graph without Graphoid if the clients are JavaScript-only?

Hi all - we had a meeting on this and a question emerged. Is it possible to run Graph without Graphoid if the clients are JavaScript-only?

IIRC, yes. Graphoid is supposed to be a fallback mechanism for old(er) browsers and those with JS disabled.

@mobrovac @dr0ptp4kt there is a much larger issue -- performance, and this is identical to maps: when a page loads, do you want to show a map / graph right away, e.g. if it is at the top, or do you want to load a large library, load all the data for it, and only then show the page? Or to show an empty box that eventually gets loaded? Live map requires leaflet, tile images, could get data from commons, could run sparql query (potentially overloading the server), etc. The exact same issues are for the graph - load Vega lib, load data, run sparql qeuries, etc. So if map needs it, graph needs it, and if servers can take both - sure. I would much rather have interactive content from the start.

Thanks @mobrovac @Yurik.

Totally understood on the user observed performance (at least when it's on the critical rendering path) and static asset payload, just wanted to make sure if it's the case that theoretically JS-only provides all of the functionality (for JS clients) and there wasn't some additional use case.

What's the current static raster purge mechanism if the underlying data processed by the SPARQL engine has been updated? Is the query result data materialized into some sort of data structure based on some sort of taint trigger or cache expiry? What triggers the updated raster?

Maybe as a quick fix, graphs could initially render as a generic "graph loading" image. Then vega and all dependencies needed could be loaded async and eventually render the graph. With graphs not super widely used yet, this shouldn't add too much load on our static asset serving. But it might become a problem if graphs become widely used. Hopefully a suitable solution / replacement for Graphoid would be available by then. For what it's worth, I think this is what Yuri wanted to do at the very beginning, but ops and others raised concerns that resulted in Graphoid.

Note: it would be possible to set the size of the "graph loading" image from the vega definition.

Note that current versions of Vega (since 4.3.0) need ES6 support, so we would either need to transpile Vega with Babel (slightly increasing the JS footprint) or keep Vega locked at an older version. Some back of the envelope calculations:
Minified transpiled Vega core: 442 KB
Minified d3: 222KB
Total extra JS load (minimum): 664 KB
Obviously, we couldn't load that for every page as it would more than double our JS load per page, and we probably wouldn't even want to top load it on pages that needed it. It seems most realistic that we would end-load it on pages that need it and show a spinner in place of the graph until then (as suggested above).

Another, not yet mentioned consideration: There was a significant syntax change between Vega 1.5, Vega 2, and 3+. Graph extension & Graphoid support both 1.5 and 2, but Graph ext does it in an imperfect way: browser only loads the latest version required, so if a page has both 1.5 and 2, it would just load 2. This only affects dynamic graphs (v2 only), but it really breaks during page preview -- only v2 graphs are previewed if both are present. When used as images from Graphoid, there are no issues (regular page viewing).

Vega versions after 3 are all backward compatible, and now the team is making an extra effort to keep it that way, so in-place library upgrade after v3 is fine.

There are very few v1.5 graphs out there, so they can really be easily fixed, but introducing v3+ means that both versions must be loaded in a java-script-only page that has both v2 and v3 (both correct Vega lib and d3 lib). And obviously v1.5 and v2 should eventually be phased out, now that Vega has been syntax-stable for so many years, and developing very rapidly quality, feature, and community-wise.

@Yurik in addition to pulling both version of the JS for pages bearing both v1.5 and v2 syntax, the other thing is to encourage the syntax being updated, is that right?

I would say so, @dr0ptp4kt, I'd maybe even go so far as to host a graph edit-a-thon to upgrade all v1.5 and v2 graphs to v4 and eliminate the problem entirely. I've just been talking with Wikimedia NYC and they wanted to experiment with a bit of tech flavor in their regular edit-a-thons, this would be a perfect start. Obviously we have to support v3 first. I would suggest the following plan:

  • implement the "graph loading" spinner, turn off graphoid
  • add support for v4 to the graph extension (but < v4.3.0 so we don't have to worry about Babel for now, that might become easier when we settle T199004)
  • convert all graphs (not hard just tedious, might take a few months of edit-a-thons and individual effort)
  • remove support for < v4
  • implement a new graphoid to get rid of the "graph loading" spinner

I think this sequence makes each phase simpler and removes the multi-version support problem, at the cost of some up-front upgrading work.

@Milimetric I think the removal of Graphoid will be far more difficult than just keeping it. If you remove it, you will have to support multiple Vega versions on the client - not as trivial to do as it is for the Graphoid service (which was built with that support).

@dr0ptp4kt - correct, you will need to pull v1.5, v2, and v3+ to the client depending on what graphs are being served, but they might not work well together. And once done, yes, all graphs should be migrated to v3+.

On an unrelated note, I find funny that the top wiki in terms of graphs is very low in requests in graphoid. In fact, I get the impression (could be false) the two ordered datasets don't correlate very well with one another but I think this is a question best left to be answered in some other task.

I can explain at least results for Russian Wikipedia: we have opted to show Graph-based page view statistics on every action=info page, so that might drive up requests without registering in the first stat (see MediaWiki:Pageinfo-header). Example:
https://ru.wikipedia.org/wiki/Заглавная_страница?action=info

Excuse me, I'm a simple guy : I'd like to know why a direct javascript (vega based) isn't possible ? It looks the graphs are build this way : code > javascript/vega > image built > image shown on the page, am I right? If yes, what are the reasons preventing "code > graph shown on the page" ?

Hi @Bouzinac, usually the software tries to make basic web page rendering functional without a strict need for JavaScript. Obviously a significant portion of the software actually requires JavaScript in practice for a feature rich experience, but basic reading is one case where usually we want to be able to support the basics.

This may be one case where the tradeoff would be worth it, though.

The JavaScript is also relatively heavier weight, which in some cases can influence page load time and anyway uses more bandwidth. I like the @Milimetric idea of having a loading indicator in case the graph element is in the viewport, and to reduce unnecessary bytes shipped it would be possible to lazy load the JavaScript (clients without JS would probably just have to get a link or synthetic box of the raw data or something).

At the moment, though, as @Yurik notes, it would probably be best if going the JS route to get all the graphs onto something like the latest stable version of Vega so that the client JS can expect to process only one type of syntax, otherwise the JS loader would actually have to figure out how to get the correct version of the JS pertinent to the version of the graph. This particular software architecture decision also would be something that probably needs to be considered for down the road for upgrades. My impression is the syntax is now pretty stable, but there's a risk that there could be backward incompatible changes for future JS updates (like if JS should be updated because of identified security issues, non-security issues, etc.).

At the moment, engineering resources at the Foundation are committed to other project work, so we're in a bit of a holding pattern. I'm wondering if there are volunteers who could work through reviewing (and tweaking) syntax on the graphs to get them to a latest stable version of the JS and who might be interested in working on a lazy loading approach.

At the moment, engineering resources at the Foundation are committed to other project work, so we're in a bit of a holding pattern.

Assuming no volunteers show up (I sure hope they do), do we have any estimation on how long this holding pattern is expected to last? The reason I am asking is because this code stewardship request is already ongoing for >6 months and the hardware and the Operating System powering this service are going out of support.

Jrbranaa raised the priority of this task from Medium to High.Jul 16 2019, 9:24 PM

the hardware and the Operating System powering this service are going out of support.

What's the timeline for that?

Graphoid is based on NodeJS, so it should be migrated to Node 10 (and thus Stretch) either this or next quarter, see T210704.

the hardware and the Operating System powering this service are going out of support.

What's the timeline for that?

Just for completeness sake, as the OS + nodejs version part is more important (and actually sooner timewise), I 'll comment on the hardware part.

Hardware is a bit more difficult to answer as the machines powering the clusters providing the service are not all bought on the same time.

It's anywhere between yesterday or in the next couple of quarters (for eqiad), and in the next year (for codfw). We are already stretching the former (eqiad) and have no plans to refresh it because of the move to kubernetes. We can handle hardware failures though, albeit at decreased capacity. The latter (codfw) is still OK but we don't have any plans to refresh it either (for the same reason).

Hi team, just to follow up on what I've let some of you know by email, I'm going to investigate the option of finite temp resources to help handle the more pressing item on EOL to buy us more time. We'll need to scope narrowly so that we address the EOL problems at a minimum and hopefully sidestep any further challenges like this in the future.

Heads up @MarkTraceur @phuedx (CC @pmiazga) @Jhernandez @ssastry @JoeWalsh , we'll need to think on this as it has ripple effects in several areas.

We're mulling this over still.

Hello all. We're going to turn this into a client-side feature and divest of the server side rendering componentry.

@MarkTraceur will be recruiting for a contractor. @Keegan we'll want to communicate out the changes to support here. It's been a while since I've requested support directly, what's the process to coordinate? I think it would be best to work with you on this given your proximity to things multimedia.

We'll need to drop support of server side rendering for non-JavaScript clients. For the primary article content we prefer to do server side rendering for the main content of the article to maximize the number of clients that can consume the content, but there are other sorts of multimedia (A/V, 3D) where we in practice have a similar requirement for web browsers, and so this is another example of such a treatment.

@dr0ptp4kt and how will this work? The graphs will start out as some placeholder image as the vega module is brought in async?

@Milimetric the visual treatment depends on a few factors, although yes, I think we'll want a placeholder to account for any deferred / lazy fetch or client compute timing things.

Hello all. We're going to turn this into a client-side feature and divest of the server side rendering componentry.

We'll need to drop support of server side rendering for non-JavaScript clients.

Most graphs, if I understand correctly, are static, or sufficiently near-static that they can work with normal svgs without any JS. (See also T96309.) The library is quite large, and usually completely unneeded.

For the primary article content we prefer to do server side rendering for the main content of the article to maximize the number of clients that can consume the content, but there are other sorts of multimedia (A/V, 3D) where we in practice have a similar requirement for web browsers, and so this is another example of such a treatment.

I don't think that's a reasonable comparison. Outside certain exceptions, we're talking about simple images, not multimedia.

@Yair_rand agreed the relatively larger JS component size is not ideal. We'll need to compensate with deferred/lazy loading. As for the comparison of the different media types, also agreed on the notion that there's a qualitative difference between static rasters of graphs that wouldn't have follow-on interactivity and plays of video/audio - this is more to note that JS as a requirement for some non-wikitext content is somewhat normal. Video and 3D have a raster, too, it's just that the stuff is intended to be played to enrich the experience. It's certainly a tradeoff around maintenance cost versus the benefits of inbuilt static rendering.

@dr0ptp4kt not just JS -- data sources could be far larger component to the graphs - e.g. one graph could mix together multiple data sources, including some tabular data pages (up to 2MB each), queries to Wikidata (currently broken btw -- lots of users are complaining because millions of population graphs are broken), a few images from commons, and even some mediawiki API calls. A full download could be in tens of megabytes, and some could be slow.

@dr0ptp4kt thanks for the mention, sorry for the delay. I'm at WikidataCon this weekend, I'll get back to you about filing the support request early next week.

@Yurik good point. I think the guidance for use of the functionality will necessarily need to note such complications. I imagine instrumentation would be the fastest path to flagging particularly slow cases for remediation.

I believe that's distinct from the issue in T226250: Graph not displayed if linked to a wikidata query, but for the specific case on slow SPARQL query results independent of payload size, it seems like some sort of pre-emptive caching of the materialized query result is going to need to be employed. I gather and seem to recall that was more or less solved or compensated for in the server side rendering approach (sans the apparent cross-domain issue that's surfaced), but with the JS approach it seems like the architectural pieces (browser JS, MW app servers, edge cache, SPARQL) for the fetching-and-waiting orchestration may need to be re-organized.

@Yair_rand agreed the relatively larger JS component size is not ideal. We'll need to compensate with deferred/lazy loading. As for the comparison of the different media types, also agreed on the notion that there's a qualitative difference between static rasters of graphs that wouldn't have follow-on interactivity and plays of video/audio - this is more to note that JS as a requirement for some non-wikitext content is somewhat normal. Video and 3D have a raster, too, it's just that the stuff is intended to be played to enrich the experience. It's certainly a tradeoff around maintenance cost versus the benefits of inbuilt static rendering.

Isn't "deferred lazy loading" primarily what the server-side component is for? I thought non-js users was a minor benefit, the primary use case being we want to lazy load graphs because they can have a very large footprint in terms of component sizes.

I guess its not clear to me what exactly the decision to kill the server side component means. Are we replacing it with a different server side component? If so, how do we know the new component won't suffer from the same concerns that Graphoid has.

I guess its not clear to me what exactly the decision to kill the server side component means. Are we replacing it with a different server side component? If so, how do we know the new component won't suffer from the same concerns that Graphoid has.

No, the idea is that we will no longer have any server-side component dependency. This decision is based on the fact that we don't have engineering capacity to support a dedicated server-side component. It's obviously not ideal, and may have various problems (as mentioned by @Yurik and @Yair_rand), but the alternative would be to retire the Graph extension entirely, which would be worse.

Someone must have fixed it? I just noticed that it works at the moment!

Hm, i see that it's not working on every article. Hope someone is working on it now.

I guess its not clear to me what exactly the decision to kill the server side component means. Are we replacing it with a different server side component? If so, how do we know the new component won't suffer from the same concerns that Graphoid has.

No, the idea is that we will no longer have any server-side component dependency. This decision is based on the fact that we don't have engineering capacity to support a dedicated server-side component. It's obviously not ideal, and may have various problems (as mentioned by @Yurik and @Yair_rand), but the alternative would be to retire the Graph extension entirely, which would be worse.

Honestly, I still have a lot of reservations about this decision. Graphoid is a dead simple service to write. There were some basic mistakes made in the first implementation that could be fixed by anyone with JS experience (like whoever is hired to the contractor position), and supporting a well-functioning Graphoid will be simpler than supporting a graph extension that comes with a lot of caveats about what data it can pull and how big it can be and all that. Plus, it really makes no sense to load graphs on-demand. People would probably switch to static images over a random placeholder that somehow figures out you're looking at it and lazy-renders. The design for that interaction alone would probably take longer than just fixing graphoid.

My main point is, take a look at what Graphoid needs to do, I'm not being "weird optimistic Dan" here, it really is like:

  1. configure Vega
  2. get graph definition
  3. call render

Recent events have compelled me to adopt graphoid. I will personally maintain it until we find a rightful owner. I will be carefully re-reading this and other conversations about it, implementing it as properly as I can. Do let me know if anyone here has new thoughts about best practices, I'm starting with a clean slate from the latest service-template-node.

@Milimetric - If you are working on fixing the problems with Graphoid, please create a Phabricator task to track this work.

Recent events have compelled me to adopt graphoid. I will personally maintain it until we find a rightful owner. I will be carefully re-reading this and other conversations about it, implementing it as properly as I can. Do let me know if anyone here has new thoughts about best practices, I'm starting with a clean slate from the latest service-template-node.

Hi @Milimetric, great that you are volunteering for this. From what I gather, you will not be adopting graphoid in the current state, but rather rewrite it from scratch. For what is worth, I personally think this is a much more probable to succeed endeavor than trying to fix the current software. With that in mind, there's probably a couple of things that multiple people need to have a shared understanding about, before we even move to the technical side of this. I am listing them below, I expect @dr0ptp4kt, @MarkTraceur and @Jseddon will want to add to the list.

  • There will be a new software/service, written from scratch that provide at least a subset of the functionality that graphoid did provide. What that subset and in what way would be is fully up to @Milimetric, and I guess the maintainers of the Graph extension (who are they btw?) as that software is one of the primary cooperating components.
  • As a consequence of the above, Graphoid, as in the currently deployed software (https://gerrit.wikimedia.org/g/mediawiki/services/graphoid) remains unadopted.
  • We probably want the new software/service to not have the exact same name, mostly for disambiguation reasons. Calling the new thing graphoid will create expectations about the functionality it provides that will only serve to cause confusion
  • Graphoid could still be undeployed. The ramifications of that have already been explored by @dr0ptp4kt and @MarkTraceur and hence a plan already exists and code for it is (being) written.
  • Debian Jessie's removal from the fleet is on March 31st (the day I write this). Unfortunately we aren't going to me that, but with the state of the world in this COVID-19 dominated present, that's fine. We can delay it, but it will inevitably happen as that infrastructure is dated and unsupported.
  • There is no clear timeline for the deployment of when the new service will be ready for deployment.

Given all the above, my suggestion would be to decouple the creation of the new software (<insert name here>) of the plans to undeploy the current Graphoid. It may feel like a waste, but it allows teams to proceed on their own terms and schedules without blocking each other, imposing deadlines or increasing risks for the infrastructure. Whatever (subset of?) functionality of Graphoid is provided by the new software can be adopted by the various components (e.g. REST API, Graph extension) in their own schedules. The obvious downside is that there may be a period of time with reduced functionality/performance for end-users, but to me this approach has a higher chance of success longterm and tries to minimizes. Hence it's preferable.

The technical side of things should probably go into another task, but I 'd like first to make sure we are all in agreement with the above.

@Milimetric, @dr0ptp4kt, @MarkTraceur, thoughts on the above?

For reference, all of the current discussion seems to be taking place in T249419: RFC: Render data visualizations on the server.

I'd like to propose that we close this ticket, since we've decided we are no longer going to be using Graphoid in its current implementation (and have already started undeploying it). Once the RFC at T249419 concludes, I imagine we would want to create some fresh tickets for the work involved, rather than recycling this one. Any objections?

So who has taken stewardship over the service for its remaining operation and to do the work done/being done to enable its client component and to undeploy and decom the service and its MW integration?

The undeployment/client-side enabling is being handled by @Jseddon.

For supporting Graphoid's final days and decommissioning the service after it's full undeployment, I'm assuming Service Operations would handle that. Is that a good assumption, @akosiaris?

I'd like to propose that we close this ticket, since we've decided we are no longer going to be using Graphoid in its current implementation (and have already started undeploying it). Once the RFC at T249419 concludes, I imagine we would want to create some fresh tickets for the work involved, rather than recycling this one. Any objections?

+1

The undeployment/client-side enabling is being handled by @Jseddon.

For supporting Graphoid's final days and decommissioning the service after it's full undeployment, I'm assuming Service Operations would handle that. Is that a good assumption, @akosiaris?

Yes that's correct. Once the phased per wiki undeployment is done (tracked in T242855), we 'll take over the remainder of the actions in that task and remove it service itself from the infrastructure.

(This is intended to be a response to T334940#9424739, but posting it as this task has more context)

Note (and warning of potential code reusers): The last (before archival) version of graphoid is prone to RCE due to CVE-2020-26296. For a POC, see https://github.com/vega/vega/issues/3018#issuecomment-748929438

In addition, there are 0-days in current version of Vega and Vega-lite (see T336595#8888955).