Page MenuHomePhabricator

New Service Request: Wikidata Termbox SSR
Closed, ResolvedPublic

Description

Description: https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service
Timeline: 2019-01-31 would be fantastic. 2019-02-28 would be great. We will start to get worried if we don't have the service by end of March 2019.
Diagram:

Wikidata Termbox SSR Architecture 2018-12-18.png (399×636 px, 18 KB)

Wikidata Termbox SSR Sequence-SSR.png (664×961 px, 58 KB)

Technologies: nodejs
Point person: @WMDE-leszek, @Addshore as a backup (especially deployment topics)

Source code: https://github.com/wmde/wikibase-termbox, move to gerrit to happen latest in early Jan 2019.

Load Details

The initial responsibility of this service will be the rendering of the term box for wikidata items and properties for mobile web views.
Currently wikidata.org gets no more that 80k mobile web requests per day (including cached pages, and non item/property pages).
If we were to assume all of these requests were actually to item and property pages that were not cached this would result in this SSR service being hit 55 times per minute.
(In reality some of these page views are not to item or property pages, and some will be cached) so we are looking at no more than 1 call per second.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service we see that "There is a server-side and the client-side variant of the code, which are distributions of the same implementation." Looking at the current repository, we see that it shares data models (https://github.com/wmde/wikibase-termbox/tree/master/src/datamodel) but that there is common, client-only, and server-only data-access logic. I think architecture decisions hinge on this separation and the reasons for it.

In the simplest case, this code would be almost identical client and server-side. No matter where it's running, nodejs or the browser, it would request data, receive it, and render it as html. Mediawiki and wikibase would be responsible for compiling that data in a nice way. The flow would be:

BROWSER --get-interface--> NODEJS --get-data--> APIs --return-data--> NODEJS --return-interface--> BROWSER

As I understand it, in the proposed architecture, mediawiki sits in between the browser and node to prevent exposing another public endpoint for the node service. Is my understanding correct?

Now, I started looking through the code and it looks like there's an effort to keep server and client logic as common as possible, with factories and interfaces and nice patterns, but there are still differences. It looks like there's good reason for this, but how would the client be able to act as a fallback for the server as proposed?

By the way the code looks good and I'm glad to see this work going forward.

Krinkle subscribed.

This task proposes a significant change to software architecture and should follow the RFC process. Tagging it as such.

I've also triaged it and after reading the description and linked wikipage, I believe the following should be clarified before TechCom can be effective in gathering and processing input from relevant stakeholders. Specifically:

  • A description of the "termbox" feature, how it currently works technically in production, and what its current requirements are.
  • A brief statement of what problem this proposal would solve (e.g. additional requirements you want to solve but currently can't).

If you'd like input or feedback on anything from TechCom at any point, feel free to move it back to the "Inbox" column on the TechCom-RFC workboard. You can also use the "Request IRC meeting" column to request an office hour on IRC about this RFC.

"We should not introduce a service that is called by MediaWiki, and itself calls MediaWiki."

Slightly OT, but a +1000 YES to this. Been there, seen that antipattern, it's a mess to reason about. The coupling of the 2 components make it near impossible to test/benchmark/debug the interactions. It also a mess to untangle and fix when it is identified.

@Milimetric wrote:

In the simplest case, this code would be almost identical client and server-side. No matter where it's running, nodejs or the browser, it would request data, receive it, and render it as html.

I think you are right, that does make sense; Then, the rendering service should sit between the client and MediaWiki, and not be called by MediaWiki. But that means it cannot be used for serving the default rendering of the page from index.php.

If we have these two requirements:

  1. use vue.js based rendering when serving page content from index.php
  2. the mechanism for accessing Wikibase content from vue.js based code should be the same on client and server

...then I see no way to avoid the "PHP calls JS calls PHP" issue in general.

But for the case at hand, there might be a workaround: the PHP code that renders the (Wikibase Entity) page content already knows what data will be needed for the rendering. It can send it to the rendering service along with the request to render. The vue.js code would then need to have a "fake repo request" facility that would just use data that was passed in with the original request, and would fail (or at least warn) when trying to load any additional content by calling the actual MediaWiki API. I think that solution would still be fairly clean, and would perform better than calling back to the MediaWiki API all the time.

Thinking further into the future, the dilemma could perhaps also be solved by separating MediaWiki's storage layout entirely from the application layer; this way PHP and JS code can use the same storage layer API for retrieving page content. But this is just a thought for the future - it woudl take a while, and I'm not even sure it's a good idea.

But for the case at hand, there might be a workaround: the PHP code that renders the (Wikibase Entity) page content already knows what data will be needed for the rendering. It can send it to the rendering service along with the request to render. The vue.js code would then need to have a "fake repo request" facility that would just use data that was passed in with the original request, and would fail (or at least warn) when trying to load any additional content by calling the actual MediaWiki API. I think that solution would still be fairly clean, and would perform better than calling back to the MediaWiki API all the time.

When rendering the page, index.php knows the exact data that needs to be rendered already, correct? If so, it can send that to the client, and then based on whether JS/SW is available or not, the client either renders it or sends a request to the service which sits behind Varnish. Or am I missing something.

Thinking further into the future, the dilemma could perhaps also be solved by separating MediaWiki's storage layout entirely from the application layer; this way PHP and JS code can use the same storage layer API for retrieving page content. But this is just a thought for the future - it woudl take a while, and I'm not even sure it's a good idea.

Somewhat OT for this task, but I would actually argue that the democratisation of the storage layer is a good thing, as it allows different entities to focus on wht they are supposed to achieve. Naturally, this implies the storage solution is scalable, secure, etc.

When rendering the page, index.php knows the exact data that needs to be rendered already, correct? If so, it can send that to the client, and then based on whether JS/SW is available or not, the client either renders it or sends a request to the service which sits behind Varnish. Or am I missing something.

That works, but defies the purpose. The idea is to present a default rendering to clients that don't have JS enabled (or no sufficiently current JS support). That rendering should be generated by the same vue.js code that does the rendering on the client.

Clients that do have JS support will ask the API for different data (based on user language settings, for logged in users) and will re-render the term box based on that. No server side rendering required.

That works, but defies the purpose. The idea is to present a default rendering to clients that don't have JS enabled (or no sufficiently current JS support). That rendering should be generated by the same vue.js code that does the rendering on the client.

But the same principle applies, regardless of whether the req is sent by the client or not. If index.php has all the data to render it, then that means that it can send it to the service directly without the need for the service to call MW back (now, whether the client will be used as a proxy is a different issue).

Clients that do have JS support will ask the API for different data (based on user language settings, for logged in users) and will re-render the term box based on that. No server side rendering required.

This confuses me. You first need to render the page on the server before you know whether the client supports JS/SW or not, so it will need to be rendered on the server irrespective of the client's capabilities in case where MW calls the service directly before handing out the page.

When rendering the page, index.php knows the exact data that needs to be rendered already, correct?

I just had a brief chat with @Jakob_WMDE and @Pablo-WMDE about this. For the current use case ("term box"), this would be easy, but for the anticipated generalized use case ("render entire wikibase entity") this is non-trivial: the template doesn't just need the entity data itself, but also the labels of referenced items, data types of properties, localized names of units, etc. Only the JS code really knows what it needs, and even it does not have that knowledge in a central place.

You first need to render the page on the server before you know whether the client supports JS/SW or not, so it will need to be rendered on the server irrespective of the client's capabilities in case where MW calls the service directly before handing out the page.

Yes, that is correct, my statement was imprecise: there is no need for server side rendering of the personalized term box; that is done on the client. Only the general rendering with the default languages needs to be rendered on the server. This is also the version that is cacheable, intended to be included in the output of index.php.

(Actually, this is an assumption, it may not be true. Currently, we do render the personalized term box on the server for logged in users, and merge it into the output by substituting a placeholder; I assume that the plan is to not do that, and drop support for a personalized term box for people without JS. But I did not confirm this assumption)

There sure has been a fair amount of discussion on this ticket!

So I have created an updated interacting diagram showing off a few more details of the overall flow (and updated the description)

Wikidata Termbox SSR Sequence-SSR.png (664×961 px, 58 KB)

This highlights the various levels of caching.

  • Calls for entity data will be going via varnish, and Special:EntityData is cacheable
  • Other calls to the mw / wb api will not be varnish cached but will be cached and reused by the service itself, probably be a TTL of 1 minute.

Load Details (now included in the description)

The initial responsibility of this service will be the rendering of the term box for wikidata items and properties for mobile web views.
Currently wikidata.org gets no more that 80k mobile web requests per day (including cached pages, and non item/property pages).
If we were to assume all of these requests were actually to item and property pages that were not cached this would result in this SSR service being hit 55 times per minute.
(In reality some of these page views are not to item or property pages, and some will be cached) so we are looking at no more than 1 call per second.

Replies to specific points, with notes from @Jakob_WMDE & @Pablo-WMDE

Oh, I got that wrong - I thought the service would be public facing and called directly from the client! But apparently, itt's not, it would be hidden behind api.php. So, my first point, "we would not expose a new endpoint we would not expose a new endpoint", is invalid. Instead, it should read "We should not introduce a service that is called by MediaWiki, and itself calls MediaWiki."

The initial deployment will not even have a proxy to the service via api.php, it will be entirely internal.
No part of the SSR service is public facing.

As I understand it, in the proposed architecture, mediawiki sits in between the browser and node to prevent exposing another public endpoint for the node service. Is my understanding correct?

Yes.

Now, I started looking through the code and it looks like there's an effort to keep server and client logic as common as possible, with factories and interfaces and nice patterns, but there are still differences. It looks like there's good reason for this, but how would the client be able to act as a fallback for the server as proposed?

This is an exceptional case and only works for user with JS enabled.
"Client" here would be the client-side JS, served by Wikibase.
In case there is no response from the SSR, the fallback would be CSR-only, i.e. it will only work for people with js enabled.
This addresses the rare scenario that there is no working SSR configured (in installations run by 3rd parties now WMF) and still offers termbox features, but at the cost of not being accessible to users with JS disabled.

If index.php has all the data to render it, then that means that it can send it to the service directly without the need for the service to call MW back (now, whether the client will be used as a proxy is a different issue).

The "termbox" is more of an application than a template.
Only it knows which data it needs - actively "sending" data to it requires knowledge of which information is needed.
While seemingly trivial in the beginning this will, as the application grows, become a burden in maintenance - and potentially in performance if data data that has become obsolete is sent "just to be sure".

(Actually, this is an assumption, it may not be true. Currently, we do render the personalized term box on the server for logged in users, and merge it into the output by substituting a placeholder; I assume that the plan is to not do that, and drop support for a personalized term box for people without JS. But I did not confirm this assumption)

Personalized = user language on top, then "languages most likely to be spoken" by user. We still strive for this behavior.
Cacheability will have to be taken into account.

Now, I started looking through the code and it looks like there's an effort to keep server and client logic as common as possible, with factories and interfaces and nice patterns, but there are still differences. It looks like there's good reason for this, but how would the client be able to act as a fallback for the server as proposed?

This is an exceptional case and only works for user with JS enabled.
"Client" here would be the client-side JS, served by Wikibase.
In case there is no response from the SSR, the fallback would be CSR-only, i.e. it will only work for people with js enabled.
This addresses the rare scenario that there is no working SSR configured (in installations run by 3rd parties now WMF) and still offers termbox features, but at the cost of not being accessible to users with JS disabled.

My question here was more, how can the client render everything it needs, when some of the logic, for example for data-access, is only in the src/server/data-access folder? In other words, if the client functionality is a super-set of the server, I would expect only common and client folders, and the server to use logic out of the common folder. I'm asking both for the termbox right now and plans for this service in the long-run.

My question here was more, how can the client render everything it needs, when some of the logic, for example for data-access, is only in the src/server/data-access folder? In other words, if the client functionality is a super-set of the server, I would expect only common and client folders, and the server to use logic out of the common folder. I'm asking both for the termbox right now and plans for this service in the long-run.

I hope I understand the question correctly, but the idea is that e.g. src/server/data-access and src/client/data-access both implement client/server-specific functionality of the same data-access interfaces. CSR gets all its information from the window environment (e.g. mw.config, existing Wikibase JS services, etc), whereas the SSR requests the data from the API. The client functionality is not a super-set of the server. They share most of the code, and what they cannot share is hidden behind interfaces and implemented in src/client and src/server respectively.

Thanks @Jakob_WMDE, I think we're saying the same thing in slightly different terms, and it's because I'm not being precise. It's ok for the client and server to have different implementations, but you're saying the have the same capabilities, right? I was thinking that for the client to be able to render everything the server does, plus handle interactivity and other features in the future, its capabilities would have to be a superset of the server capabilities, right? So, there's no html that could only be rendered by the server, right? It seems that way, components and interface are all shared.

If so, great. If not, it feels important to understand what happens when the server's not there and the client doesn't have some of the server-specific functionality.

So then for me the main question remains around the request flow that everyone else is discussing. From @Addshore's response it sounds like you considered this option and decided against it:

  • the request to index.php is conditionally routed directly to the SSR service. In our world, the SSR service is there, so we configure it in Varnish, it returns html, and Vue takes over client-side. For other mediawiki installations, index.php knows to render a basic version of the html which pulls in the Vue.js modules. Once this loads in the browser, it renders the interface.

If this was rejected, I'm just curious, what were the expected problems? One potential optimization with that approach could be, if SSR is enabled, you send a smaller client-side module that doesn't include data-access which would only ever happen once, on page-load. So you could put that in a client-fallback folder or something, include it in the version served by index.php but exclude it from the version served by SSR.

(by the way, the build is nicely separated for client vs. server here https://github.com/wmde/wikibase-termbox/blob/faf3dcca602da9f2287e8be9acaa1023144b23d7/vue.config.js#L14, where webpack analyzes the code and includes only what's required by the particular environment it's building against)

the request to index.php is conditionally routed directly to the SSR service. In our world, the SSR service is there, so we configure it in Varnish, it returns html, and Vue takes over client-side. For other mediawiki installations, index.php knows to render a basic version of the html which pulls in the Vue.js modules. Once this loads in the browser, it renders the interface.

@Milimetric I am not sure if I get the suggestion right, but it sounds like almost what we are proposing, with the only difference that it would include some server side rendering of the Wikidata parts of page in PHP ("rendering a basic version of HTML").
We've decided to not do this because one of our goals is to have the single implementation of the rendering logic. That's why the rendering code is in JS/node. Having a second implementation of rendering in PHP, including rendering vue templates in PHP etc, is the solution we didn't want to lock ourselves in, as it basically asking for trouble (forgetting to update a second place when updating the other one is bound to happen at some point, for example).
Or maybe you meant, that without the SSR in place, the server will only render some kind of placeholder, and once vue js is loaded on the client, the UI will be rendered. This would mean a less pleasant user experience (UI "appearing" later), but would still provide the full experience to the user, as long they have JS enabled in their browsers. If this is your suggestion, I am happy to confirm that this is the approach we're planning here.

I believe answers from WMDE staff above have already been touching on this topic, but to explicitly mention them, let me try to answer it:

"We should not introduce a service that is called by MediaWiki, and itself calls MediaWiki."

The intention of introducing the service is not to have a service that call Mediawiki. As discussed above, it is needed for the service to ask for some data, and this data shall be provided by some API. Currently, the only API that can provide data about Wikidata items etc is indeed wrapped in MediaWiki. Should there be another, more light-weight API providing access to the same data, the service would most likely be using it.
The implementation, as can be seen by looking at the linked source code, is not bound to those APIs it talks to be MW-ones at all.
If between lines people are here suggesting to have some lighter API to access Wikidata data, I can only second those ideas. Introducing such API(s) seemed to be too much of an endeavour to do together with introducing new front-end solution. We preferred to take one step at a time.

Let me state it again: the SSR service should not need to call the mediawiki api. It should accept all the information needed to render the termbox in the call from mediawiki.

So we should have something like:

  • Mediawiki makes a POST request to SSR, sending the entity data, the information of the languages *and* all the messages
  • The service transforms said data into HTML, sends it back to mediawiki

This has several advantages over the proposed solution:

  • Only one RPC call, instead of several (at least 3 AIUI)
  • No need for costly change propagation to get cache invalidation for the service
  • No need for caching in the service, even
  • Performance is going to be much better given we don't have to instantiate the mediawiki request context several times
  • The design is more resilient as the service will not depend on any backend in order to work

I would consider switching to this model a necessary condition for a production deployment.

@Addshore @WMDE-leszek do you see any reason why this wouldn't work?

The "termbox" is more of an application than a template.
Only it knows which data it needs - actively "sending" data to it requires knowledge of which information is needed.
While seemingly trivial in the beginning this will, as the application grows, become a burden in maintenance - and potentially in performance if data data that has become obsolete is sent "just to be sure".

An application is something that provides a public-facing, standalone functionality. This doesn't seem to be the case.

The only difference here would be moving some business logic to the MediaWiki extension instead than inside the service.

The burden in maintenance is there if we keep building circular dependency loops between services, it will only be moved onto the people running the service in production instead than on deployment coordination.

Also, if we're going to build microservices, I'd like to not see applications that "grow", at least in terms of what they can do. A microservice should do one thing and do it well. In this case, it's using data from mediawiki to render an HTML fragment; unless you want to make it do something different, the thing that might change is what data it needs to use.

The intention of introducing the service is not to have a service that call Mediawiki. As discussed above, it is needed for the service to ask for some data, and this data shall be provided by some API. Currently, the only API that can provide data about Wikidata items etc is indeed wrapped in MediaWiki. Should there be another, more light-weight API providing access to the same data, the service would most likely be using it.

Apart from the fact that the MediaWiki api is very performant, I don't think that "the data shall be provided by some API". The data might as well be provided by the caller. I can assure you it's absolutely impossible to get better performance than being fed the data by your caller, vs having to retreive it with several RPC calls, no matter how lightweight (which I guess means "performant") the API you call might be.

To avoid misunderstandings: I was not questioning MediaWiki's action API being performant. By "lightweight" I was referring to "PHP has high startup time" point @daniel made above as one of the reason why no service should call MW API.

To avoid misunderstandings: I was not questioning MediaWiki's action API being performant. By "lightweight" I was referring to "PHP has high startup time" point @daniel made above as one of the reason why no service should call MW API.

I read @daniel's comment as "why do that costly operation several times instead of doing it once", which makes sense to me. Also sorry if I came across a bit strong, but there has been a lot of FUD spread on the topic of MW API performance, and I wanted to dispel that myth :)

@WMDE-leszek ok, we're on the same page, except the crazy part of my proposal. I was saying directly routed to SSR service as in, without ever hitting mediawiki and spinning up the mediawiki context. So this would expose SSR publicly. The fallback stub HTML generated by mediawiki would work as you understood. Was this ruled out, routing directly to SSR?

@Addshore can you go more in depth about why the SSR service is the only one that knows what data it needs? Would it be possible to factor out the code that compiles that data and implement it anywhere else, or is it for some reason tightly coupled with the interface rendering?

@Joe said:

the SSR service should not need to call the mediawiki api. It should accept all the information needed to render the termbox in the call from mediawiki.

After talking to the Wikidta folks, I realized that this is not easy at all. It requires the callingg code to know what data is needed, and that depends on implementation details buried in several places of the rendering code, possibly different for entity types defined by different extensions, etc. For the term box alone, "send the entity" will work. For rendering more (e.g. all the statements), this will not work.

@Milimetric said:

I was saying directly routed to SSR service as in, without ever hitting mediawiki and spinning up the mediawiki context.

That doesn't work, the goal is to deliver a full page to the client, with all the skin chrome. This is for index.php serving a full page.

So, overall, it seems like the solution proposed by the Wikidata team is the only one viable at this time. I'm not very happy about this, but I don't really see an alterrnative. Doing It Right (tm) would require MediaWiki to have either the presentation layer or the storage layer split off. That'll have to wait a couple of years.

The one viable improvement that was raised would require ESI (or a similar mechanism): the call to index.php would return HHTML that contains a placeholder for the termbox, which would be resolved in the edge by a callback to the SSR service, which could then call back to MediaWiki as needed, without a circular dependency. But IIRC, @Joe does like ESI either...

So, overall, it seems like the solution proposed by the Wikidata team is the only one viable at this time. I'm not very happy about this, but I don't really see an alternative. [..]

I'll look into this RFC in more detail at a later time, but at glance, this does not seem fair.

As currently presented, it appears this RFC is lacking a problem statement. It isn't solving a product need, user need, or technological need. Rather, it starts out on the assumption that we're going to have UI code in production (based on Vue.js) written in a way that contains too much business logic in its templating code.

If we're talking about a new approach for Wikidata front-end, I think it makes sense to generalise the acceptance criteria to the larger problem being solved. If we keep the above (seemingly artificial) restriction in place then, in my opinion, there is no room for an RFC conversation to take place.

@Krinkle said:

Rather, it starts out on the assumption that we're going to have UI code in production (based on Vue.js) written in a way that contains too much business logic in its templating code.

That is correct: the Wikidata team made the determination that they want to use vue.js for the frontend. The idea is to use a standard framework that allows rich interactions, and to use the same code base for user interaction on the client, and for rendering the initial static view. This is opposed to the current situation, where the static rendering is done inn PHP, and interaction is implemented in JS, with static templates shared between both mechanisms. This has proven extremely cumbersome and inflexible, this was alreaddy a problem when I was still the tech lead of the wikidata team. Using vue.js with some sort of server side execution mechanism has been proposed an discussed by the Wikidata team for about two or three years, at hackathons and summits. We have also discussed it at TechCom, though not as an RFC I think. In essence, they were always told to go ahead. Going back on that now doesn't seem right not me, and I don't see a viable alternative (other then React perhaps, which would pose the exact same problem).

I agree that it would be good to have the motivation for using vue.js documented, along with the alternatives considered and trade-offs evaluated. But this does not come out of the blue. This has been in the pipeline for years, and every effort was made to communicate with various WMF teams about this effort.

I agree that we can't go back on decisions that are 3 years in the making. But I do like Timo's point that we should state the problem. Here's an attempt:

"Implementing interaction in JS on top of static html rendered by PHP is inflexible and complicated. It leads to code that is hard to maintain, test, or improve. This is because two different code bases which require developers with different skill sets have to coordinate any changes."

Feel free to adapt that to better describe the problem ^. @Krinkle, if something like that was made part of the RfC, would that be a good step?

herron triaged this task as Medium priority.Jan 2 2019, 9:06 PM

Thanks everyone for comments so far. This ticket in its current state is definitely not a ready RFC, you're right. We're going to turn it into one/create a separate RFC ticket in upcoming days.
As a preparation, we're going to have a little chat with @Joe on Monday, to talk about our plans, and see what elements of our plan are particularly not clear/problematic. Things are clear in our heads, but this does not mean it is all obvious to other people :) Interested CET timezone people are welcome to join of course.
This talk is of course is not meant as a replacement to the RFC review process.

Did the chat with @Joe happen? What was the outcome?

It did (today, not on Monday though). I hope the outcome is I hope that @Joe and @akosiaris have a better understanding of what are we have in mind. What we talked about (Wikibase front-end architecture) is also going to be turned into an RFC in next 24 hours.

WMDE-leszek changed the task status from Open to Stalled.Jan 9 2019, 6:17 PM
WMDE-leszek removed a project: TechCom-RFC.

I've submitted RFC about the whole concept of Wikibase front end changes as T213318. I've taken the liberty to subscribe all people who were kind to comment on this task to the RFC.
This ticket was intended as the "pure" service request, hence removing the TechCom-RFC tag. Also marking as stalled for now, to focus on the RFC ticket first, as the service request has little point without the general approach of ours being discussed first.

I've tried to read all of it and maybe I've missed something, but I am still not sure what added value having such separate service gives us. We are creating pretty complex patten of interaction between Mediawiki and outside service, and I am not sure why this service is better than just having code inside Mediawiki to do the same. Is it caching? But we can do caching inside Mediawiki. It is ability for partial rendering HTML? But we can have partial-render API inside Mediawiki (and as I understand, that's how it going to be called by the front-end UI anyway?). Is it supplying data for SPA written in Javascript for doing the rendering on the client? But Mediawiki API is completely capable of returning JSON as well as returning HTML. I am not sure I really understand - what is the added value of the external service here?

If the idea is to have stateless (QID,language)->HTML renderer, with expectation that the HTML would be highly cacheable - is it true that it is? And if so, is it true that the best way to cache it is creating external service?

Please excuse me if I missed some important part - there's a lot of text to read :) If there's an answer for this already, please feel free to point me to it.

@Smalyshev No reason to be sorry for asking the right questions!
If we truly wanted* to boil the reason down to one sentence: The reason this is a dedicated service is the language it is written in (typescript), which was chosen because it allows us to create an implementation which can be compiled/transpiled to work on both server and client - something not possible with PHP.
Please see T213318 for (hopefully) more pointers in that regard (ctrl+f "avoid redundant implementations").

(* with the know consequences in complex discussions)

The reason this is a dedicated service is the language it is written in (typescript), which was chosen because it allows us to create an implementation which can be compiled/transpiled to work on both server and client - something not possible with PHP.

For the concrete use case of the Wikidata term box: The term box is an interactive element UI. But the initial rendering should happen on the server (to avoid jumping and delays, and also to provide a view for clients with no JS). If both renderings should use the same code, we need to somehow run the JS code on the server.

I get the idea of server-side HTML rendering to avoid delays. But I am kinda questioning whether the advantage of splitting code outside of PHP into separate service and all the complexities that follow from that.

Maybe there's an easier way to achieve the same benefit - e.g. using some kind of template engine that exists in both JS and PHP, so we can be reasonably sure that if template is the same the output is the same? We're not talking Wikitext-class complexity here, we'd be controlling the templates and the box itself seems to be pretty basic HTML. So I wonder whether it's worth such complications just so it always be run in Javascript...

@Smalyshev I believe the approach we are suggesting really makes a difference when thinking beyond just rendering a template for a particular part of an Item page. I can definitely be blamed for not making it clear enough in the description of this task that we are NOT intending to build the template renderer, but the UI application which determines what data to load (from what APIs), what actions (APIs) to call etc, and also what to render. Also, it should be noted the ultimate goal of ours is not to just change the way this particular element of the item page (i.e. the "termbox") is handled in the Wikibase front end code, but it is about the Wikibase front end as a whole, with termbox being the first step only.

The overall approach/concept is more precisely (I hope) described in the RFC ticket T213318, which also might also be a better place to raise questions like the one of yours.

Thanks, T213318 makes it a bit clearer though not entirely 100% clear which parts stay in PHP and which parts move to JS. Would it also be true that there is no way to render Wikibase content (even without editing) on non-Javascript browsers? The SPA approach in T213318 suggests that any Wikibase interaction - including merely displaying http://www.wikidata.org/wiki/Q42 - requires SPA being started? Or we keep maintaining parallel PHP renderer to SPA/JS renderer?

@Smalyshev My understanding (which may be dated or incomplete) is this: there would be no PHP rendering, JS rendering would need to happen either in the browser on on the server. If the server supports SSR (e.g. for Wikimedia projects), you can read with a non-JS browser. If the server does not support SSR (e.g. on shared hosting), then you need a JS enabled browser to read.

We do not intend to maintain the "proper" UI logic in PHP. SSR service will render the page on the server side which will (via MediaWiki etc) make it to reader's browser. We do want Wikidata readers and editors have an access to data even if they decide to disable JS in their browser, hence the request for this service.

Addition: we might likely end up doing some "rudimentary" rendering in PHP in case there is no node SSR service, or it is unavailable, to avoid having a complete blank page before the JS loads. But as we are to gradually migrate, this is not going to be the issue from the very start, as we only "replace" the part of the item/property page in the first step.

WMDE-leszek changed the task status from Stalled to Open.Feb 25 2019, 7:21 PM

Unstalling as the RFC requested above is now approved! (see: T213318)

How are we going to proceed with this service request?
For what it is worth, we've requested the security review of the code that will be running on the service: T216419

Per some IRC discussions we had in #wikimedia-serviceops, the code should be updated to be service-runner compatible as this will greatly increase homogeneity and allow for easy handling of things like logging, metrics as well as potentially rate limiting and DNS cache management. As far as I have understood, @Tarrow is already working on that (many thanks!). Following that, we should enable the pipeline for the project so that it builds docker images for this services. The first part is easy, we will need just a .pipeline/blubber.yaml and enabling the pipeline. Adding @thcipriani for that. Docs are currently under https://wikitech.wikimedia.org/wiki/Blubber. I can help with the next step which is the creation of a helm chart for the service. After that, (and assuming all other prereqs are done) it's time for deployment.

There are a number of questions to answer as well, regardless of all the technical questions above:

  • We will need contact details in case the service suffers an outage
  • We will need a person/team to be the service-owner (that can be the same as above)
  • The service owner will have to state what will the required availability of this service (no, it can't be 100%) be.
    • In order to answer that question in a structured way another question needs to be answered and it's "What will be the SLO(s) of this service" (SLO stands for Service Level Objective). Which in turns implies another question (I promise it's the last in this stack) which is "What are the SLIs for this service" (SLI stands for Service Level Indicator aka a metric). Assuming a service-runner integration we will be able to have easily metrics (and graphs) for requests/sec, latency, errors. Any of these (or all + whatever else is deemed important to measure) can be chosen as SLIs and a target (aka an SLO) can be chosen on those. For better explanation of the terms SLI, SLO for now please have a look at https://landing.google.com/sre/sre-book/chapters/service-level-objectives/, as we are still building the documentation for all of this.
  • An estimation of the traffic the service is expected to receive: Already given, it's ~1 req/s
  • A schedule for when we would like to have this deployed to production as SRE will have to reserve some cycles for this.

I am indeed already working on it.

Just so you know the current state: we are already using blubber for the CI i.e. we have 'service-pipeline-test' run in zuul/layout.yaml. I suppose this needs to soon include '-and-publish'?

Our blubber will obviously need tweaking once we have the service-runner integration working though.

I am indeed already working on it.

Just so you know the current state: we are already using blubber for the CI i.e. we have 'service-pipeline-test' run in zuul/layout.yaml. I suppose this needs to soon include '-and-publish'?

Yes, that's correct.

Our blubber will obviously need tweaking once we have the service-runner integration working though.

@akosiaris thanks for listing up information needed by SRE. This is very helpful.
Before I add those to the task description, we'd appreciated you having a look at these (especially SLOs) and advice in case we are somehow off. In particular, given the service is going to be accessed via MediaWiki, its availability is depending on availability of MW. As we don't know what is the uptime of Wikipedia, so please tell us if we're going over board here.

Contact details in case the service suffers an outage: Wikimedia Deutschland, leszek_wmde at freenode, @WMDE-leszek on phabricator

Person/team to be the service-owner: Wikimedia Deutschland

SLOs of this service

  • 500ms < request latency (seconds) < 1500ms
  • error rate (1/n) < 1/1000
  • system throughput (1/second) < 10
  • availability (% of time) > 99.9

SLIs for this service:

  • request latency (seconds)
  • error rate (1/n)
  • system throughput (1/second)
  • availability (% of time)

An estimation of the traffic the service is expected to receive: ~1 req/s

A schedule for when we would like to have this deployed to production:

  • as soon as feasible, preferably 2019-03-31 at the latest.
  • Note, there is security review of the service code pending (https://phabricator.wikimedia.org/T216419) being performed by @sbassett, who could possibly inform on status if needed. We don’t know whether the service can be deployed but kept inactive/unused until the security review is done.

Thank for this, it's appreciated. Note that we haven't still decided over which time window the availability will be calculated, but it's probably gonna be quarertly (3months that is).

I have to say I am wondering a bit about the latency as the low end seems to be quite high (500ms?). It's your service of course and I am fine with it.

As far as "wikipedia" availability goes, that's a very difficult number to come up with (it's being asked of in the past) as there are many components that constitute "wikipedia". Anyway 99.9% looks fine to me at this point, and this is not anyway set in stone, we can reevaluate in a few quarters if it ends up being unfeasible.

  • ASAP as feasible, preferably 2019-03-31 at the latest.

Unfortunately that is not feasible. This is the last month of the quarter and there's goals already running, plus we are short handed. But I guess we can target early next quarter.

I have to say I am wondering a bit about the latency as the low end seems to be quite high (500ms?). It's your service of course and I am fine with it.

There is a chance we might want to revise this down in the future but right now it seems that being this high would not be unreasonable for us. It's difficult to gauge what is realistic right now.

  • ASAP as feasible, preferably 2019-03-31 at the latest.

Unfortunately that is not feasible. This is the last month of the quarter and there's goals already running, plus we are short handed. But I guess we can target early next quarter.

What more work/steps are needed? Is it:

  • Helm Chart
  • Security Review

Is there a "similar to production" test environment we could use check that everything is correctly setup on our end?

If the main work load on your end would be post deploy troubleshooting/monitoring etc.. could we consider a sooner deploy of the service without any expectations of service level (and not send any wikidata.org traffic there)? Just to increase our confidence that we haven't overlooked something.

I have to say I am wondering a bit about the latency as the low end seems to be quite high (500ms?). It's your service of course and I am fine with it.

This indeed a relatively high number. We have come up with this estimate taking into account that our service is depending on others (MW API for time being) to fullfil its job, so we have taken this uncertainty into account, hence higher figure. We of course don't mind if the service works faster.
And FWIW, we understand "request latency" here as the time from client making a request and getting the response from the service, not a time between client making a request and the service becoming aware of it (just to be clear).

As far as "wikipedia" availability goes, that's a very difficult number to come up with (it's being asked of in the past) as there are many components that constitute "wikipedia". Anyway 99.9% looks fine to me at this point, and this is not anyway set in stone, we can reevaluate in a few quarters if it ends up being unfeasible.

Understood, thanks!

  • ASAP as feasible, preferably 2019-03-31 at the latest.

Unfortunately that is not feasible. This is the last month of the quarter and there's goals already running, plus we are short handed. But I guess we can target early next quarter.

This is understandable. We have had to try, though :) We're looking forward for this hopefully happening next quarter.

I have to say I am wondering a bit about the latency as the low end seems to be quite high (500ms?). It's your service of course and I am fine with it.

There is a chance we might want to revise this down in the future but right now it seems that being this high would not be unreasonable for us. It's difficult to gauge what is realistic right now.

Sure. As I said, fine by me.

  • ASAP as feasible, preferably 2019-03-31 at the latest.

Unfortunately that is not feasible. This is the last month of the quarter and there's goals already running, plus we are short handed. But I guess we can target early next quarter.

What more work/steps are needed? Is it:

  • Helm Chart
  • Security Review

There is also the LVS and DNS work, but that's in SRE realm. It does take time, but I don't think there anything you can do to expedite that.

Is there a "similar to production" test environment we could use check that everything is correctly setup on our end?

There is a staging environment that might fit part of the "similar to production" definition. Things that are deployed in kubernetes are required to go through that environment anyway, so it's part of the process.

If the main work load on your end would be post deploy troubleshooting/monitoring etc.. could we consider a sooner deploy of the service without any expectations of service level (and not send any wikidata.org traffic there)? Just to increase our confidence that we haven't overlooked something.

Unfortunately the main work load is pre-deploy. There is of course work post deploy, but I did not take that into consideration when I answered.

I have to say I am wondering a bit about the latency as the low end seems to be quite high (500ms?). It's your service of course and I am fine with it.

This indeed a relatively high number. We have come up with this estimate taking into account that our service is depending on others (MW API for time being) to fullfil its job, so we have taken this uncertainty into account, hence higher figure. We of course don't mind if the service works faster.
And FWIW, we understand "request latency" here as the time from client making a request and getting the response from the service, not a time between client making a request and the service becoming aware of it (just to be clear).

Thanks for the clarification. Just noting that, assuming correct service-runner integration, we will anyway have metrics for the former, the latter would have been more difficult.

As far as "wikipedia" availability goes, that's a very difficult number to come up with (it's being asked of in the past) as there are many components that constitute "wikipedia". Anyway 99.9% looks fine to me at this point, and this is not anyway set in stone, we can reevaluate in a few quarters if it ends up being unfeasible.

Understood, thanks!

  • ASAP as feasible, preferably 2019-03-31 at the latest.

Unfortunately that is not feasible. This is the last month of the quarter and there's goals already running, plus we are short handed. But I guess we can target early next quarter.

This is understandable. We have had to try, though :) We're looking forward for this hopefully happening next quarter.

Thanks for the understanding. We are drafting next quarter goals this week, I 'll make sure to add this.

Final question and just for verification, this ain't going to be exposed directly to the internet after all, right? Rather this will be called by mediawiki, per the architectural diagrams attached to this task.

Final question and just for verification, this ain't going to be exposed directly to the internet after all, right? Rather this will be called by mediawiki, per the architectural diagrams attached to this task.

That's correct!

Thanks for the understanding. We are drafting next quarter goals this week, I 'll make sure to add this.

Just poking to double check that this was added (I would hate to see it missed).

Thanks for the understanding. We are drafting next quarter goals this week, I 'll make sure to add this.

Just poking to double check that this was added (I would hate to see it missed).

Yup. It's in already

https://www.mediawiki.org/w/index.php?title=Wikimedia_Technology/Annual_Plans/FY2019/TEC3:_Deployment_Pipeline/Goals#Q4_Goals

Draft, so not final wording, but it wasn't forgotten. Thanks for doublechecking!

Hey @akosiaris, not sure I see it in there, maybe I'm lost a bit... can you point me out to where the SSR is in https://www.mediawiki.org/w/index.php?title=Wikimedia_Technology/Annual_Plans/FY2019/TEC3:_Deployment_Pipeline/Goals#Q4_Goals

?

Indeed, there was some copy/paste gone wrong involved. I amended the wording of the goal to reflect what the plan is.

Disclaimer: I do understand that SRE and others have been pretty busy last weeks, and I would absolutely take "we cannot really say this week" as an answer.

As it is Q4/Q2 now, I wondered @akosiaris whether we'd be able to narrow down somehow a timeline of getting the service deployed? Having a rough estimate here would help with resourcing here at WMDE, and would also be helpful for Security folks in the context of T216419.

@WMDE-leszek Hi, sorry for not answering any sooner, last few weeks have been crazy indeed.

Q4/Q2 started We can start work on this finally. The tracking task for this is in T220402. Barring various issues that might creep up and stall this, we could get this deployed by April's end, hopefully even before that.

I 'll start posting updates on T220402

@WMDE-leszek, @Tarrow.

I 've noticed we are missing one thing. We have a dashboard for the service's metrics in https://grafana.wikimedia.org/d/AJf0z_7Wz/termbox but it looks like the service isn't sending request metrics to the local statsd instance. It is sending however memory and nodejs GC metrics which already appear in the graphs. service-runner already has code for it, see https://github.com/wikimedia/service-template-node/blob/a92cccea9df8af7bda315b4eb41495c95bbfbdad/lib/util.js#L98 for how to wrap the /termbox endpoint (or any other endpoint, wrapping /_info is also helpful) in order to have traffic, error and latency graphs (and consequently SLIs) for it.

@akosiaris Yep; we've interpreted it as something we really need before exposing it to real traffic. We've got a ticket open about it that we'll be picking up real soon: T226625

We're assuming that it's still ok to continue in parallel with some integration with test.wikidata.org though

@akosiaris Yep; we've interpreted it as something we really need before exposing it to real traffic. We've got a ticket open about it that we'll be picking up real soon: T226625

Cool, that is great. Thanks for the input.

We're assuming that it's still ok to continue in parallel with some integration with test.wikidata.org though

As far as I am concerned, that's totally fine.

@akosiaris Happy to say that were are now have code have sending out request metrics form master.

In our investigation about going to test.wikidata.org it became apparent there was a little more infrastructure stuff to do so I made: T226814 and I think also the required patches.

Would you be able to review them? Also note that I haven't (because I don't know how to) added any secrets just the dummy reference to them.

akosiaris claimed this task.

The service has for long been deployed and even has nice dashboards in grafana, resolving.