Request creation of OurWorldinData VPS project
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Doc_James
	Feb 5 2022, 5:24 PM

Description

Project Name: <OWIDM>

Wikitech Usernames of requestors: <Doc James, TimMoody>

Purpose: <The goal is to create a mirror of OurWorldinData>

**Brief description: Our hope is to create a mirror of OWID that we can than adjust and use within Wikipedia articles similar to how we are using them on MDWiki https://mdwiki.org/wiki/User:Doc_James/OurWorld

Will need g3.cores4.ram8.disk20 and 200G of cylinder disk

**How soon you are hoping this can be fulfilled: As soon as possible

Related Objects

Mentioned In: T345962: Scope a successor to the Graph Extension
T324989: Application Security Review Request : OurWorldInData
T303853: Enable OurWorldInDataMirror extension at euwiki
Mentioned Here: T303853: Enable OurWorldInDataMirror extension at euwiki
T186293: Evaluate external layers for privacy and open source
T324989: Application Security Review Request : OurWorldInData
P21855 Some Wikimedia services names and repo links

Event Timeline

Doc_James created this task.Feb 5 2022, 5:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2022, 5:24 PM

lgtm +1

Actually, a couple of followups:

When you say 'mirror' do you mean a literal mirror, with the exact same content? If so, why is the mirror needed beyond the primary service?

Will this project be serving live content to the wikis? If so there might be bandwidth/reliability things we should discuss.

No I mean our own version of their site. Similar to how we have our own version of Open Street Maps... This will allow us to make cosmetic changes to their material. This will allow us to translate material into other languages. And this will allow us to propose this as a possibility to various communities that may be interested in using it.

I'm still a little confused about the intent after looking at the mdwiki example. I see template examples that embed content from ourworldindata.org. Is the intent to change the source to be this project? Such that the template becomes something like:

<templatestyles src="Owid/styles.css"/><ourworldindata>https://ourworldindata.wmcloud.org/grapher/share-with-schizophrenia </ourworldindata>

{{onlyoffline|[[File:Share-with-schizophrenia.svg|thumb|400px]]}}

For example the version we have in this infobox has too much text such that the map is made too small. Having our own version will allow us to trim this to make it format better. https://mdwiki.org/wiki/Neglected_tropical_diseases

@nskaggs Yes that is correct. So that the material comes from wmcloud. This will allow us to 1) customize these 2) ensure privacy for our readers. Both of these will be required before a proposal can be made to use on Wikipedia.

Ok, and for clarity, the intended use would be solely on mdwiki.org?

So the initial use, as a proof of concept, is to be on MDWiki. Community consensus will need to be developed successfully before it can be rolled on other wikis. Basically the work flow is:

Get a version of OWID running
Adjust the formatting and get it displaying properly on MDWiki (plus stills displaying offline in ZIMs)
Develop community consensus for use on a version of Wikipedia
Bring such a request to the WMF tech team to permit this

I have been working on trying to get Our World in Data working on Wikipedia for more than 6 years, fully understand that there is potentially years of work ahead and always a good chance the movement will not be interested... https://en.wikipedia.org/wiki/Template:Global_Heat_Maps_by_Year

@Doc_James , Thank you very much for clarifying things! The WMCS team reviews requests every week, and we'll discuss this request in more detail tomorrow.

This request has been approved in the 9 Feb 2022 meeting. +1 to creating the project. Let us know if we can be of any further assistance.

• nskaggs added a project: cloud-services-team (Kanban).Feb 9 2022, 5:09 PM

• nskaggs moved this task from Inbox to Clinic Duty on the cloud-services-team (Kanban) board.

Project created.

openstack project create --description 'owidm' owidm --domain default
openstack role add --project owidm --user docjames projectadmin
openstack role add --project owidm --user timmoody projectadmin
openstack role add --project owidm --user timmoody user
openstack role add --project owidm --user docjames user
openstack quota set --gigabytes 220 owidm

rook closed this task as Resolved.Feb 9 2022, 6:48 PM

In T301044#7695138, @Doc_James wrote:

So the initial use, as a proof of concept, is to be on MDWiki. Community consensus will need to be developed successfully before it can be rolled on other wikis. Basically the work flow is:

Get a version of OWID running

Adjust the formatting and get it displaying properly on MDWiki (plus stills displaying offline in ZIMs)

Develop community consensus for use on a version of Wikipedia

Bring such a request to the WMF tech team to permit this

I have been working on trying to get Our World in Data working on Wikipedia for more than 6 years, fully understand that there is potentially years of work ahead and always a good chance the movement will not be interested... https://en.wikipedia.org/wiki/Template:Global_Heat_Maps_by_Year

https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/thread/JFDYY5EIWNACATLHRWHUIXYSDCTS5LVE/ jumps right into the discussion with community that is step 3 here. I'm not sure that discussion has been framed in a way that the community would be expected to understand that the MediaWiki extension code has not been reviewed for security or performance or that the backing data service is supported only by two already over subscribed volunteers.

Well we have a version of OWID running. We have adjusting the formatting to mostly display properly (plus work in offline ZIMs). And now we are working on a community consensus. Community consensus will take at least a month.

The security and performance review is step 4. No sense wasting limited WMF technical staff time if no community is interested in this material. The forth step could definitely take a year or more.

I'm surprised to see an extension mentioned for the first time. See https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment for extension requirements.

@Aklapper So the first steps is to determine if anyone wants the extension and how much doing this work will cost Wiki Project Med. Our board will than look at if and when we can afford the work.

sbassett added a project: Security-Team.Mar 2 2022, 7:05 PM

The extension mentioned is not really part of this project. It does not run on any of this project's resources nor is there any plan for it to do so. The security and performance review would be of more interest to company on whose infrastructure it gets implemented. If the extension is of broader interest a separate project should be created for that.

I'm not certain the WMF Security-Team would take issue with the wmcs piece of this project or even the extension, certainly not within the context of non-Wikimedia-production MediaWiki installations. For Wikimedia production though, this would likely be reviewed and rated as a critical risk, which would require c-level risk acceptance for deployment. Such a risk rating would be due to the primary application living within wmcs and the intention of allowing untrusted and unsanitized OWID data to be displayed on Wikimedia production projects via MediaWiki templates.

Change 767596 had a related patch set uploaded (by Reedy; author: Reedy):

[operations/mediawiki-config@master] Use namespaced ApiFeatureUsageQueryEngineElastica

https://gerrit.wikimedia.org/r/767596

gerritbot added a project: Patch-For-Review.Mar 2 2022, 9:52 PM

Reedy removed a project: Patch-For-Review.Mar 2 2022, 10:00 PM

This should absolutely be integrated into WM projects. Extremely valuable information, granularly sourced data from a trustworthy secondary source which we must find better ways to synchronize with. (the # of editor-hours currently spent updating COVID data points by hand is unnecessary)

@sbassett a straightforward solution might be to have the OWID data integrated into our data-stores (however we choose to handle tabular data : on WD or on Commons), so that it has the same trust/sanity profile as other data. We are talking about a limited number of datasets, well within the volume of existing updates.

@Sj That was my prior 5 year effort to get OWID working within WP.

Tabular data is here https://commons.wikimedia.org/wiki/Data:Obesity_Males.tab

One can see a graph that it creates here https://en.wikipedia.org/wiki/Obesity#Epidemiology

But this is not nearly as nice as what we are generating here https://mdwiki.org/wiki/Obesity#Epidemiology

If you look at the history tab on Commons, no one is actually updating these because it is a multistep finicky process that requires a bunch of find and replaces in a google doc.

And more than half the time the tool breaks down completely.

https://en.wikipedia.org/wiki/User:Doc_James/OurWorld#COVID19

In T301044#7751255, @Sj wrote:

@sbassett a straightforward solution might be to have the OWID data integrated into our data-stores (however we choose to handle tabular data : on WD or on Commons), so that it has the same trust/sanity profile as other data. We are talking about a limited number of datasets, well within the volume of existing updates.

There are likely several ways to mitigate these issues and reduce the overall risk. They would probably include:

Making the existing wmcs app a proper Wikimedia service, hosted within Wikimedia production.
Building out some kind of automated, data-sanitization layer.
Building functionality to avoid the direct usage of externally-hosted resources (images, timed media, js, css, etc.)
Building a content moderation process, either through the service layer or the extension.

With respect to moving this to a Wikimedia service... What would be the steps for that? Reach out to the team itself to discussing the possibility of this becoming a quarterly goal at some point in time?

I could also ask if anyone was willing to join this effort as part of their 10% time... https://www.mediawiki.org/wiki/Platform_Engineering_Team/Personal_Development_Share_Back

In T301044#7751516, @Doc_James wrote:

With respect to moving this to a Wikimedia service... What would be the steps for that? Reach out to the team itself to discussing the possibility of this becoming a quarterly goal at some point in time?

I think it'd really be more about finding the resources to build this. There isn't really a "services team" per se (despite what that mediawiki.org documentation might imply) as anyone can build a production wikimedia service and attempt to have it deployed. Many teams at the WMF have built such services and there might be a product team that would be interested in supporting such an effort. And a number of service repos could serve as guidance for building a new service: P21855.

taavi mentioned this in T303853: Enable OurWorldInDataMirror extension at euwiki.Mar 15 2022, 4:34 PM

bd808 mentioned this in T324989: Application Security Review Request : OurWorldInData.Mar 6 2023, 8:48 PM

There are likely several ways to mitigate these issues and reduce the overall risk. They would probably include:

Making the existing wmcs app a proper Wikimedia service, hosted within Wikimedia production.

This would be ideal, but keep in mind that the graphs rendered use javascript and svg, so they are not your typical php app,

Building out some kind of automated, data-sanitization layer.

owidm is not a straight copy, but is generated with python and bash scripts that remove all known privacy leaks and make the pages comply with licensing. They also reformat graphs to suit inclusion in a wiki page.

Building functionality to avoid the direct usage of externally-hosted resources (images, timed media, js, css, etc.)

This has been done in terms of all known externally-hosted resources.

Building a content moderation process, either through the service layer or the extension.

The conversion of the upstream site has been automated, but the generated pages are manually moderated so that any new privacy or other issues from upstream can be remediated before deployment.

The entire deployment process is a work in progress, but it is planned to remediate problems as they occur.

In T301044#8715389, @Tim-moody wrote:

This would be ideal, but keep in mind that the graphs rendered use javascript and svg, so they are not your typical php app,

Wikimedia production supports a number of node/js (some examples here) and golang microservices these days. I believe this is the preferred option of SRE et al.

owidm is not a straight copy, but is generated with python and bash scripts that remove all known privacy leaks and make the pages comply with licensing. They also reformat graphs to suit inclusion in a wiki page.

Is there a public repo for this code that could be reviewed? I'm not certain if this is https://github.com/owid/owid-static, but that repo appears to not exist or be private.

Wikimedia production supports a number of node/js (some examples here) and golang microservices these days

I didn't mean to imply that it is a nodejs app. Once converted it is straight up html, but most of the work is done by runtime javascript on the front end that draws the graphs.

is there a public repo for this code that could be reviewed?

https://github.com/wpmed/owid-static-mirror. Please bear in mind that I said it is a work in progress.

In T301044#8715648, @Tim-moody wrote:

Wikimedia production supports a number of node/js (some examples here) and golang microservices these days

I didn't mean to imply that it is a nodejs app. Once converted it is straight up html, but most of the work is done by runtime javascript on the front end that draws the graphs.

Right, I meant to imply that a nodejs or golang service might need to be built as a wrapper to either serve that content or make it accessible via some api layer.

is there a public repo for this code that could be reviewed?

https://github.com/wpmed/owid-static-mirror. Please bear in mind that I said it is a work in progress.

Thanks.

In case it has gotten lost in the shuffle the best way to see how this works on mdwiki and the proposed use on other wikis is https://mdwiki.org/wiki/WikiProjectMed:OWID

An editor who wishes to embed a graph on a wiki page navigates to the graph on owidm, clicks a button that copies the extension syntax onto the clipboard, and pastes that onto the target wiki page.

The interactive graph will then get embedded via an iframe onto the target page. The extension manages the iframe by sanitizing the source url and parameters, and the owidm batch sanitizes the page that will be in the iframe.

Sure, it's just that, as also mentioned within the security review (T324989), we can't iframe external content within Wikimedia production. And in this case, anything under wmcs or toolforge would be considered external content. So that would be a blocker to getting OWID into Wikimedia production. Hence the suggestion of developing a microservice to deliver any relevant OWID content which could live within Wikimedia production.

Okay so what would be the steps to "develop a microservice to deliver any relevant OWID content which could live within Wikimedia production"? Best

James

anything under wmcs or toolforge would be considered external content.

Well, we undertook to build owidm rather than just using owid with the understanding that this was not the case.

All of the risks identified in the security review boil down to can the maintainers be trusted and is the host secure. Does a microservice mitigate those risks?

The privacy risks identified were leakage of IP address and user agent. If the microservice is a proxy that makes a request to owidm with its own address and a hardcoded user agent string does that mitigate the risks?

In T301044#8716117, @Doc_James wrote:

Okay so what would be the steps to "develop a microservice to deliver any relevant OWID content which could live within Wikimedia production"? Best

Likely to have an experienced developer review our current nodejs service documentation and begin specifying and prototyping a new project. But there would also be questions around additional code stewardship and sponsorship as mentioned within the security review: T324989#8670328.

In T301044#8716186, @Tim-moody wrote:

anything under wmcs or toolforge would be considered external content.

Well, we undertook to build owidm rather than just using owid with the understanding that this was not the case.

All of the risks identified in the security review boil down to can the maintainers be trusted and is the host secure. Does a microservice mitigate those risks?

The privacy risks identified were leakage of IP address and user agent. If the microservice is a proxy that makes a request to owidm with its own address and a hardcoded user agent string does that mitigate the risks?

Unfortunately it very much is the case as wmcs and toolforge do not satisfy certain privacy and reliability concerns for direct usage within Wikimedia production. A Wikimedia production service which proxies, sanitizes and makes highly available OWID data for consumption would mitigate many of these risks.

By "experienced developer review our current nodejs service documentation and begin specifying and prototyping a new project" do you mean someone within the WMF technical team? Do volunteers have access to WMF production servers or only staff?

James

Well, we undertook to build owidm rather than just using owid with the understanding that this was not the case.

This whole situation is utterly bizare. I don't understand how this project got so far with nobody telling you that. Its not a new or secret policy that production isn't allowed to depend on labs or that labs & production have incompatible privacy requirements. Some of that is on you for building something before figuring out what the requirements are (a sure fire way to make a software project fail), but even still, this is such a blatant and obvious requirements mismatch, that someone should have noticed and told you proactively. I do not understand how this got so far.

In T301044#8716431, @Bawolff wrote:

Well, we undertook to build owidm rather than just using owid with the understanding that this was not the case.

This whole situation is utterly bizare. I don't understand how this project got so far with nobody telling you that. Its not a new or secret policy that production isn't allowed to depend on labs or that labs & production have incompatible privacy requirements. Some of that is on you for building something before figuring out what the requirements are (a sure fire way to make a software project fail), but even still, this is such a blatant and obvious requirements mismatch, that someone should have noticed and told you proactively. I do not understand how this got so far.

T301044#7748951 does tell exactly that, though, and it was posted very soon after the MW extension plans were brought up in this task.

What was built was a proof of concept and is in active use by MDWiki. Yes we are learning more about WMF requirements as we go along.

In T301044#8716327, @Doc_James wrote:

By "experienced developer review our current nodejs service documentation and begin specifying and prototyping a new project" do you mean someone within the WMF technical team?

Possibly, though several volunteers should be capable of building such a service as well. I would note, though, that some form of WMF sponsorship around maintenance, etc. from a Product or Tech team would likely be necessary to eventually land this system into Wikimedia production.

Do volunteers have access to WMF production servers or only staff?

A few highly-trusted volunteers do, yes. Though that is not an easy process for any volunteer to walk through.

A proxy service for calls to such caches of trusted + curated knowledge makes sense.
That doesn't seem specific to this OWID use case, is there an existing proxy library that could be generalized or extended?

In T301044#8716431, @Tim-moody wrote:

Well, we undertook to build owidm rather than just using owid with the understanding that this was not the case.

It seems there are at least four levels of internality: on the cursed web, on blessed partner sites, on some WM system (like wmcs; can be useful for caches as in T186293), and in production. iframe on a wiki --> thin production-service proxy --> more complex system on wmcs --> updated via APIs on a blessed OWID site --> curated by OWID from the cursed web + world.

It makes sense that a proxy, as thin + general as possible, be well tested and maintained. That feels like a different scope than this issue, but something that this could rely on. And the spec for such a proxy should perhaps come from those who care about the relevant security and privacy details...

owidm [or any other caching setup on wmcs] could fallback to direct API calls to OWID. [It might also want to capture historical snapshots, which could be saved as static images on commons]

@sbassett most of this discussion has focused on code, but you also mentioned the question of trusting the owid data. What would be the permitted storage for such data to be used by a nodejs service? In general can you help us with where to find documentation on such policies.

Is https://gerrit.wikimedia.org/r/admin/repos/mediawiki/services/example-node-api,general the example you suggested?

In T301044#8718872, @Sj wrote:

A proxy service for calls to such caches of trusted + curated knowledge makes sense.

I disagree. Suddenly we're dependent on a bunch of external resources we don't control run by people we don't know or vet. What if a natural disaster strikes WMCS and takes it all out? What if the server goes down and all the people with access have disappeared? What if the people running these curated sources of knowledge are just evil or suborned? Just proxying it still gives away way too much control in my opinion. Its just as much, if not more so, an SRE/RelEng issue than a security issue.

It seems there are at least four levels of internality: on the cursed web, on blessed partner sites, on some WM system (like wmcs; can be useful for caches as in T186293), and in production. iframe on a wiki --> thin production-service proxy --> more complex system on wmcs --> updated via APIs on a blessed OWID site --> curated by OWID from the cursed web + world.

I think there are two - production or not production. I don't see a difference between WMCS and RandomSketchyInternetSite.info. if its not subject to wmf production controls than it might as well be any site on the internet.

It makes sense that a proxy, as thin + general as possible, be well tested and maintained. That feels like a different scope than this issue, but something that this could rely on. And the spec for such a proxy should perhaps come from those who care about the relevant security and privacy details...

Proxies are tools. They are very useful tools, but still like any tool they are not magic bullets and still have their limits and need to be evaluated in the specific context they are used.

[In case it isn't obvious: i do not work for WMF, my opinions are my own and may or may not match with the opinions of relavent WMF teams]

Yes and we are happy to have this run on WMF production servers by WMF staff if required with support from us if they wish.

Bawolff: Most of our projects are built on external resources we don't control, backed up by web archives we also don't control. The threats you mention are rare events we already cope with on the projects. (and we still manage to let e.g. wikivoyage maps render external layers)

Doc + Tim + sbasset: From your suggestions, does this seem like a reasonable progression?

A modified OWID mirror (owidm), hosted on wmcs, that includes a first pass at sanitization and regular updates from the OWID site.
A nodejs service, prototyped on wmcs, which proxies, sanitizes and makes highly available owidm data for consumption
The service in 1. moved to Wikimedia production, with a maintenance plan

The owid extension can be modified to read from the services above. We can confirm it works on non-production wikis and then discuss in separate issues like T303853 whether to enable the extension on production sites.

@Sj I don't think so. When we started I understood that anything to do with production has to run inside the fort, and I mistakenly thought that wmcs was inside the fort. I now understand that we are outside of the fort and are not permitted to work inside the fort.

So I think the next step is to see if people in the fort would be prepared in principle to deploy and maintain such an application. If not there's no reason to build one.

I think we have been told that building such a service would be straightforward for an experienced developer. I can't judge that because I still have lots of questions about what apis such as service would use, where it is permitted to store data, what graphics libraries are permitted to render graphs, what constitutes sanitized data, and by whom is it vetted.

I am a big believer in free and open knowledge. I use it daily and have worked as a volunteer for the last 10 years pretty much full time to help make such knowledge available to those without an internet connection. I admire what OWID has done to help us visualize the big picture of major issues of our times. I think there would be a big benefit to wikis to be able to embed OWID graphs, and some communities have expressed an interest. So I don't find trying to make this happen 'bizarre', and hopefully others will embrace the idea.

Ok, I'll try to stop being a jerk and try to be more helpful :)

And to be clear, i don't think its bizarre that you are trying to do this - its bizarre that so many people saw you go down this particular path without engaging.

[I will add my standard disclaimer for what follows - not a WMF employee, if a WMF employees says something opposite of what I am saying, listen to them not me. These are my personal opinions]

There are two aspects of getting something on Wikimedia servers. The political and the technical. Lets start with the political:

First and foremost, getting something deployed on to WMF production infrastructure is really difficult as an outsider. It is possible, people have done it, but its hard. Most people who have done it have had friends who worked at WMF and were well respected to get advice about who to talk to and who to poke. You should prepare yourself to have to be 10x more convincing then what an internal WMF team would have to be, and accept that you will almost certainly be subject to double standards. Sucks, but that is the lay of the land.

Ultimately, the political decision on whether to deploy some outside code comes down to two factors:

How much will this improve wikimedia, make the users happy and see actual usage on Wikimedia wikis
How likely is this to break things, cause problems, increase complexity, make maintenance more difficult, etc

You should consider this process to be like a sale pitch. You want to convince WMF folks that your thing will really make a difference. At the same time, you want to convince WMF people that your thing will not cause problems, and if it does cause problems, it will be easy to fix. Remember, in the unlikely event that whatever you are working on takes down the site, you are probably not the one who will be paged at 2 am on staturday, but there probably does exist someone who would be. You want to sell the idea that it is extraordinarily unlikely something like that would happen, and if it did, the thing uses technologies similar to the rest of Wikimedia, so everyone already has experience with dealing with it.

This has implications for your technology choices. You want to choose technologies as much as possible that are more similar to things that already exist in Wikimedia production. Follow same coding style that MediaWiki uses, etc. Minimize complexity and avoid having too many moving pieces. You want your thing to not stand out. People are more likely to say yes to things they don't have to think about.

The other side of the coin is you need to sell the idea that this is valuable. You have some interest from basque wikipedia, but honestly that's not really enough. I'm not saying it has to change the world, but the usecase should be compelling. You should have an elevator pitch on how this will make Wikipedia better. How many pages is this applicable to in the best case? If it was installed on all wikis, how much use do you think it would get? etc. You want to make it clear that the benefits are worth the cost.

My specific recommendation would be to socialize your ideas more. You should have a project page on mediawiki.org detailing what you are planning to do. Both the elevator pitch of why it is a good idea, but also proposed architectural design (What are all the pieces, how do they talk to each other, etc). Once this document is written, ask for feedback. First on IRC (Primarily on #mediawiki, but also try some more specific channels like #wikimedia-releng maybe) and then on wikitech-l (and anyone else who will listen). [In principle i should include TDF in here, but TDF is confusing and clouded in secrecy. At this point I'm unclear if it still even exists. Sufficed to say I am not a fan and can't really recommend it, but do your own research on that point, YMMV]. The point of this is two fold: first of all to get feedback (Although just because someone gave you feedback doesn't necessarily mean you should listen. When it is an open to the world website, some of the feedback you get will be stupid. But be diplomatic about it and have good reasons for the choices you make). Second is to plant the idea into the minds of important developers. By the time it comes to make a decision on whether to depoy this, you don't want people to go "wtf is this?" - you want people to be, "Oh, i remember hearing about this, glad they got it finished". Its important people are warmed up to the idea by the time it comes across their plate. It also helps build momentum, which ideally you can use to find someone at WMF who likes your idea and would be willing to advocate for it internally which is extremely helpful.

Some more specific technical thoughts:

There are lots of potential implementation plans, all with different pros and cons. A separate (nodejs? backed by casandara? MySQL?) service is one approach. The downside to this approach is its a bit independent of everything else. Now there's some ways that is a really good thing, but it also has some downsides. In my view, the more technically unique something is, the less likely people will want to deploy it and the less well documented the path is. As an outsider, it is also pretty unclear to me what the current status of node.js services are at wikimedia. Is restbase deprecated? Is node.js based microservices deprecated in general? I have no idea what is considered current best practice in that space and i don't think it is particularly well documented. Although it would probably be a good idea for you to seek the opinion of people who work on the more microservice area of WMF, which is definitely not me.

My first reaction is to ask if all that is really needed? Is it necessary to store a full mirror of everything in a service. I wasn't able to find the source of owdim to see what it actually does, but would it make sense to just dynamically generate these graphs directly in a mediawiki extension (with caching)? [I suppose it depends what modifications you are doing and how expensive they are]. That cuts down on the moving pieces quite a bit, no extra things to have to setup. Nothing that is unique just to your thing that needs maintenance. Technically there are probably reasonable arguments for why this is an inferior solution (It is less isolated, fetching of data is less controlled. It is a bit more sensitive to external service issues). I would argue though that this might be easier in terms of long term maintenance as well as convincing people to actually do it, since there is basically nothing additional to setup and maintain. However I should acknowledge that this may be a controversial opinion and there are probably people who disagree with me.

@Bawolff thanks for your thoughtful reply. I would be eager to discuss this further, but I am not clear if this is the right venue. I don't want to overstay my welcome on phabricator.

I was emailed to help in this case and here is my explanation. I agree with basically everything Brian has said above.

Speaking on the political aspect of this: We have the major problem of lack of clear technical design, policies, and completely broken technical decision making process, and in short unclear "path to production" which has caused similar issues even for teams inside WMF as well before (Team A builds something, goes to SRE to deploy it, SRE refuses on the ground that it'll break everything, drama ensues). It's always better to consult with someone who has gone this path before in the design phase (very very early) so there wouldn't be a need to do major architectural changes later on. This is what's called "shift left" and I really hope to have some decent processes for it by end of next fiscal year (my dream is something like "developer handbook") anyway, enough of rambling.

On the technical side: Production is extremely different than WMCS. It's not just different networks, it's different technical paradigms. When you want to deploy something to Wikimedia production, there is a lot of important points to consider. How would HA look like? Three nodes/pods? if so, per DC or not? Should it be VMs, on-k8s or bare metal? If k8s, how many pods and with which memory/cpu size? If bare metal, what would be the hw specs and where the budget for it would come from? What is the I/O profile for networking? What are the failure scenarios? How do you handle graceful degradation? and this is just basic SRE parts, if you want to consider the security parts: What services would depend on it and what services it would need? Production services can't talk to the general internet, you have to explicitly open a hole in the firewall to get these data in which could in turn reduce the security and make it an attractive point for attacks. Who is going to maintain this after deployment? I can go on for hours but you get the idea.

Just for this specific case: Adding a proxy between wikipedia and owid through iframe can easily led to massive burst on the third part site, bringing it down which in turn could bring us down (parsing or serving start to fatal for not being able to connect to owid). It's called a metastable failure. There are many other scalability and security issues as well, I don't think this can be deployed in production without major overhaul and re-architecture. I'm sorry, I won't back this but I won't block it either.

I think this calls for a reply, leaving aside questions of consulting a policy manual that has not been written or consulting with those who have done it before, but don't respond to emails.

The biggest problem is that I think there are misconceptions as to the current architecture, which lead to assumptions about problems in putting it into production.

The current architecture is quite simple.
There is an extension on a mediawiki server and there is a source of OWID graphs, lets call it the OWID Graph Server.
When the html page is being assembled in the mediawiki, if there is an owid extension tag then an <iframe>... </iframe> block is generated that contains the source url of a graph page on the OWID Graph Server. (It is constrained so that it can only contain a url in a specific directory and only on that server and without 'unsafe' characters.)
When that html is return to the user that requested the wiki page, the browser starts rendering the html, and when it hits the <iframe> it make as request to the OWID Graph Server.
The OWID Graph Server returns html, js, css, svg, json, and images back to the user's browser. (note that this traffic does not pass through mediawiki) and it gets rendered in the browser.
It only returns the graph portion of the page and not other text or images.
As a matter of security an iframe block is supposed to be a sandbox in which css and js only act on the DOM inside the block and have no access to the containing DOM.
If the user interacts with the graph or its tabs, javascript dynamically redraws the graph without a trip to any server.

The OWID Graph Server is also pretty simple. It has a web server and static files, no php, no database, no nodejs. Ideally it would be run in a secure environment under WMF control with sufficient capacity to handle the traffic from users who access pages that contain graphs and public access to that web server. If it went down, it would leave a hole on a page, but would not affect other servers, similar to what will happen now that the Graph extension has been disabled.

OWID Graph Server is periodically refreshed with a batch job that gathers new or changed graph html and json from OWID themselves. This is like a bot and could run wherever such bots run. It downloads static html and other assets and sanitizes them using a combination of python and bash.

All of this can use improvement and tightening if a decision is made to include this functionality within wiki pages. But I suggest we start with what's there.

With the Graph extension now disabled, not to mention problematic for some time, this might also be an opportunity to reach out to OWID for collaboration on a Data Visualization infrastructure for wikis that would go beyond OWID and the Graph extension.

Novem_Linguae subscribed.Apr 19 2023, 10:04 PM

Agreed that this could be a fine opportunity for a proper collab with Our World in Data. The underlying snapshots of graphs could be refreshed quite infrequently, say once a quarter, and still be tremendously valuable and a great improvement over what we have now.

Sj mentioned this in T345962: Scope a successor to the Graph Extension.Sep 9 2023, 2:36 AM

Harej added a project: Wikimedia-Medicine.Jan 16 2024, 7:52 PM

Harej added a project: OurWorldInData.Jan 17 2024, 2:10 AM

Request creation of OurWorldinData VPS projectClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Request creation of OurWorldinData VPS project
Closed, ResolvedPublic
Actions