During the migration of the service to the kubernetes-based deployment pipeline a number of issues became evident which hindered and effectively paused the migration.
The issues identified were:
An unorthodox architecture of the API of the service.
See https://www.mediawiki.org/api/rest_v1/#!/Page_content/get_page_graph_png_title_revision_graph_id for the specification.
The graphoid service fetches the graph from the mediawiki API using essentially mediawiki pages as a data store. However to identify the required graph it requires knowing the identity of the graph which is essentially the hash of the graph.
However, that begs a question. If the caller of the graphoid service API already knows the hash of the graph, that means they either got it via talking to mediawiki, or that they calculated it themselves (aka they already have the entire graph). If they already have the entire graph, why not POST it to the service and obtain the resulting PNG themselves? Things become even more convoluted if the user is mediawiki itself (that's to my knowledge actually the case) at which case having mediawiki instruct another service to create requests back to it (the API endpoint actually) causes a cascading number of requests hitting the mediawiki API. Up to now this hasn't caused an outage, in my opinion simply because of the low traffic the graphoid receives.
This tight coupling of the 2 component (graphoid+mediawiki), makes benchmarking of the service unnecessarily risky and difficult. Having to obtain a number of titles+revisions+hashes beforehand is an unnecessary nuisance. Add in the risk that benchmarking the graphoid service might cause undue load to the mediawiki API and it essentially makes the operating parameters of the service unknown, meaning that any kind of support to it by anyone can be best-effort.
Connections to the public API endpoints
Graphoid does not support our discovery endpoints (e.g. api-rw.discovery.wmnet) but rather talks directly to the public LVS endpoints (en.wikipedia.org). That has a numbers of consequences
- It makes it difficult to point graphoid to another mediawiki cluster/DC/availability-zone making operations more difficult.
- Potentially populates the public edge caches with content that may or may not be requested
- Is prone to cache invalidation issues.
Unconventional protocol schemes
Another result of the need to talk to the mediawiki API and at the same time support for data in it has created an unwieldy number of network protocol schemes. For reasons yet unknown to me, HTTPS urls are(were?) not supported in trusted graphs by vega, nor were protocol relative urls (e.g. //url_to_whatever). Protocol relative urls are already considered an anti-pattern anyway (since 2014) , but for some reason support for this was implemented.
An example can be seen at https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/graphoid/+/refs/heads/master/config.dev.yaml#72 were 2 extra protocols are defined (wikiuploadraw and wikidatasparql). The only documentation to these is at https://www.mediawiki.org/wiki/Extension:Graph#External_data which is not explain what or why. Looking at the code  reveals a whole set of more were originally conceived but are not documented (and probably used).
Dependencies are not versioned
Not all dependencies in package.json are version pinned . This, bundled with the fact the vega library had undergone significant changes, meant that in May 2018, when the first efforts to add graphoid in the service pipeline started failed. In the end the npm shrinkwrap approach was followed in order to succeed, but it's not exactly great.
- Current maintainer
? @Yurik is the original author and maintainer in the past
- Number, severity and age of known and confirmed security issues
- Was it a cause of production outages or incidents. List them
No, but with such a low rate of requests that's explainable
- Does it have sufficient hardware resources for now and the near future (to take into account expected usage growth)?
No. The scb cluster where graphoid is currently residing is already deprecated and slated to be decomissioned. Services on that cluster are meant to be deployed on the deployment pipeline and kubernetes.
- Is it a frequent cause of monitoring alerts that need action, and are they addressed timely and appropriately?
- When it was first deployed to Wikimedia production
- Usage statistics based on audiences served
https://grafana.wikimedia.org/d/000000068/restbase?panelId=15&fullscreen&orgId=1 points to < 1 rps, https://grafana.wikimedia.org/d/000000021/service-graphoid?panelId=11&fullscreen&orgId=1 points to more, ~ 50rps (that's excluding monitoring)
- Changes commited in the last 1, 3, 6 and 12 months
1 month. => 1
3 months => 3
6 months => 3
12 months =>6
- Reliance on outdated platforms (e.g. operating systems)
The service relies on an old version of the vega library, namely vega 1 (1.5.3) and vega2 (2.6.4) as well a few vega plugins. Vega 3 is out and is being evaluated at T172938 which hasn't seen any traffic in almost a year.
- Number of developers who committed code in the last 1, 3, 6, and 12 months
- me (@akosiaris) and @mobrovac for the process of migrating the service to the deployment pipeline. No functionality was added/removed, no bugs fixed, nor anything else.
- Number and age of open patches
1 and it's a deployment pipeline change
- Number and age of open bugs
- Number of known dependencies
- Is there a replacement/alternative for the feature? Is there plan for replacement
- Submitter's recommendation (what do you propose be done?)
My personal take would be to investigate whether the service still serves a valuable purpose and based on the numbers decide whether to remove it from the infrastructure or not. If it has served it's function, that's great. If it's still useful and we do decide to keep it around, a significant amount of work will have to done to address the issues I 've highlighted above. Whether that is worth it or not, is a good question.