Page MenuHomePhabricator

Question: Do clients prefer to have references as structured data in order to build more useful UIs (rather than just showing a bunch of references at the bottom of the page)
Closed, ResolvedPublic

Description

If we split out the References HTML from the rest of the page content, we can actually deliver it as JSON instead.

Doing this would mean that clients could take the structure data and then display references in a way that is best suited on the clients and potentially doing more interesting things.

Current references usage in web/apps:

Lists

We show a list of references in specific sections at the bottom of the page for both web and apps. This is inline with the article. Is this the best UI? Would we like to show this list differently? For example, the apps use a full screen native table for displaying the list of revisions. Would we like to do something similar here?

Popup

We show a specific reference when a user clicks on a reference. In this case we show a special UI (a native component in apps, but is also a pop up in web). This seems like a case where structured data is a clear win. Currently, the apps are parsing out the references from the DOM (and will be using a JS callback in the future) to get the reference so it can be displayed in the native component. It seems the web is doing something similar.

So some questions:

  1. Do clients actually want structured data in lieu of HTML so they can construct these interfaces?
  2. Will clients still need an HTML version in addition to a structured version, OR will they be able to easily construct a UI from the structured data?
  3. Are there other uses for a references API beyond what is listed above?

Event Timeline

Reference sections can have other wikitext that would get lost in structured data. I did consider this a while back but it doesn't help MobileFrontend which needs to know how to render the HTML of the reference sections. We can of course ship the structured data AND the HTML but that's duplicating content.

e.g. consider

= References =
Some references go here
<references group="foo" />
More text
<references group="bar" />

Reference sections can have other wikitext that would get lost in structured data.

@Jdlrobson it seems like you are saying that grouping would be lost from your example? I think that wold be pretty trivial to support in a structured way (nested arrays or dictionaries). Or is there something else I am missing from your example?

Not the grouping - any content in between. When implementing lazy loading references on mobile web we came across various pages where the reference sections contain content other than the list of references. Stripping out the lists makes those sections confusing, removing the section altogether loses editor content.

Consider this example for instance:
https://en.m.wikipedia.beta.wmflabs.org/wiki/Corey_references

In the case of lazy loading references, we'd collapse the notes and references sections by default. When the headings are clicked it loads the entire content of that section.

Of course this is just one use case, but I'm saying right now we need some way to obtain the whole HTML of that section.

Fjalapeno added a subscriber: Nirzar.

@Nirzar I should ask you here as well… as this is sort of a design question:

Do you prefer to show references as they are now at the bottom of a page, or would you prefer to render them differently on each platform?

Reference sections can have other wikitext that would get lost in structured data. I did consider this a while back but it doesn't help MobileFrontend which needs to know how to render the HTML of the reference sections. We can of course ship the structured data AND the HTML but that's duplicating content.

I'm curious how many pages are marked up this way. Would it be worth finding out? Did we ever find out when we were implementing lazily loaded references on the mobile site?

@phuedx @Jdlrobson FYI: I spoke to @Nirzar about this to get some design product insights.

On structure or not:

Indeed design does want to display references in a way that is more dynamic / useful. And this is best facilitated by delivering this data as structured.
The references themselves will still contain the HTML markup, but returned as nested lists.

On reliably parsing the structure without losing context/ editor intentions:

@bearND is currently researching the problem that @Jdlrobson has illustrated above.
Essentially we need to capture a few things in the structure:

  1. The reference text itself (as HTML, along with back links and some other elements)
  2. The title of the reference section (Maybe HTML? Maybe the level and a link)
  3. Any "free form text" - which should be able to be captured as just another element in the array

Basically we should be able to deal with editor content and not lose the any information by just inserting it into the structure itself.

Additionally, he is marking the divs in the content for clients who wish to know where the references were located within the page.

@bearND is going to write up a JSON spec that we can bike shed on.

Oh, @Nirzar also mentioned: since adding the popup reference UI, the number of users that actually view the lists of references is very, very small. So it makes sense from a UX standpoint to remove that from the main content and only show it when users want to see the list.

Structured data can be derived from HTML, so all I'm saying is right now giving HTML gives the best of both world, a move to structured data could actually be detrimental to this use case if we're shipping both (more bytes). Note the references list is essential for a print view (as otherwise none of the references make sense) so we shouldn't get rid of it entirely.

Right now the only benefit I see of structuring reference data is the client doesn't have to go any internal structuring at the con of the client has to receive more data.

What actually would be neat is to do queries such as which pages use reference A - but that seems to be out of scope for this task.

I'm curious how many pages are marked up this way. Would it be worth finding out? Did we ever find out when we were implementing lazily loaded references on the mobile site?

Yes most definitely and I'd hope/expect that to be a precursor to this decision.

Right now the only benefit I see of structuring reference data is the client doesn't have to go any internal structuring

@Jdlrobson Right, and that is a HUGE benefit. Less code in the clients. Less time processing. Less chance for bugs. Being able to fix a bug on a platform and sharing that with others instantly.
Essentially this sums up the rationale for moving such logic to the server and out of the clients.

Additionally, this format supports the product/designs needs directly, which I think is getting lost here… the APIs we are building are for the Reading clients. So delivering the data in a way that supports these designs is a clear benefit.

…at the con of the client has to receive more data.
a move to structured data could actually be detrimental to this use case if we're shipping both (more bytes).

I don't think we would ship both - the whole point is being able to use the structured data to build a better interface. So delivering both would seem unnecessary.

Note the references list is essential for a print view (as otherwise none of the references make sense) so we shouldn't get rid of it entirely.

Delivering it in a different format isn't the same as getting rid of it. Currently the MobileView API returns all sections as JSON, and the clients reconstruct that into a page - so in the case of printing or any other needs to display a list, clients could do something similar, right?

Also, PDF printing is a much smaller use case when compared to users who are just viewing a page every day on a mobile device with a small screen.
I do think you are right - we do need to support printing, but it probably shouldn't be the primary driver of what we deliver.

Structured data can be derived from HTML, so all I'm saying is right now giving HTML gives the best of both world,

Designs are moving away from the inline presentation.
Along with that, HTML parsing on clients can be expensive, and requires clients to write the same code multiple times.

So I'm not sure HTML processing on the clients is the best of both worlds. …And HTML could also be constructed from the JSON just as easily.

I'm curious how many pages are marked up this way. Would it be worth finding out? Did we ever find out when we were implementing lazily loaded references on the mobile site?

Yes most definitely and I'd hope/expect that to be a precursor to this decision.

Yes… if what @bearND comes up with is "lossy" then we would absolutely have to do some research.

But if the structured data can be re-assmebled as the editor intended it, then we don't have to investigate… and honestly looking at the example you posted and reviewing some articles, it really doesn't seem that complicated. Sometimes we have text in the middle of a list - thats pretty easy to represent in an array.

Sounds to me that we should have a page-reference-list library which can got from HTML to JSON and back.

The client would still have to include JavaScript functions for either translating HTML to JSON or the other way around (for the print case at least).

Why not strip the references and leave a marker/placeholder on the HTML, and provide the structured JSON, so that the clients can render the references in the provided placeholder markers if they decide to?

That way you ship the content as HTML without losing editor content, without the references, and ship the references just once as JSON, and you have information to render them in the HTML if you decide to.

Does that make sense?

Why not strip the references and leave a marker/placeholder on the HTML, and provide the structured JSON, so that the clients can render the references in the provided placeholder markers if they decide to?

That way you ship the content as HTML without losing editor content, without the references, and ship the references just once as JSON, and you have information to render them in the HTML if you decide to.

Does that make sense?

Yes.

Twitter does the same thing when parsing out hashtags, mentions, and URLs from tweets. The API docs refer to these objects as entities, and represents them as some content with a known type that occurs between a starting and ending index in a string, e.g. https://dev.twitter.com/overview/api/entities#hashtags.

From the introductory paragraph:

Entities are never divorced from the content they describe.

@Jhernandez @phuedx I'm updating docs now, but we have stepped back from the restitching of content on the client in lieu of providing a separate API with the content still in place, but marked up.

So basically we are designing 2 sets of APIs in the PCS:
App client APIs: Deliver HTML and structured content to app clients to build native and JS interfaces
HTML client APIs: Deliver marked up HTML for clients to render full pages of content

So if you want references in line, you got 'em: Just hit the HTML client APIs
If you don't, then you can use the App client APIs

Why not strip the references and leave a marker/placeholder on the HTML, and provide the structured JSON, so that the clients can render the references in the provided placeholder markers if they decide to?

Entities are never divorced from the content they describe.

This is actually being done. There will STILL be place holders in the App API HTML if clients want to know where the reference lists came from. So this workflow is also possible. But if you really want to inject the references back in, you probably want to use the HTML client API instead.

Does this sounds like what you two are looking for?

So if you want references in line, you got 'em: Just hit the HTML client APIs If you don't, then you can use the App client APIs

Clarification necessary... but the App client APIs will allow you to render the reference HTML right?
(in web we'd want to keep any sections with references collapsed and then when the heading is clicked obtain the list of references and render the HTML.

That is what I understood from:

This is actually being done. There will STILL be place holders in the App API HTML if clients want to know where the reference lists came from. So this workflow is also possible.

So if you want references in line, you got 'em: Just hit the HTML client APIs If you don't, then you can use the App client APIs

Clarification necessary... but the App client APIs will allow you to render the reference HTML right?
(in web we'd want to keep any sections with references collapsed and then when the heading is clicked obtain the list of references and render the HTML.

@Jdlrobson I think a few crossed wires here: You really, really, really should not want to just render the Reference List HTML inline here. Nor should you want to display a collapsed Reference list section.

I think as a first step you want to sync up with @Nirzar on design for Marvin. He has changes planned for how references are displayed and they are absolutely different from what MobileFrontend does (which seems to be what you are asking to support).

In other words:

  • MobileFrontend is expected to render references inline with the content and use the HTML APIs (and is why the reference lists are returned inline with the content)
  • Marvin is expected to render references entirely separate from the content and use the App APIs (and is why references are returned as structured data separate from the content)

Hit me back here after you sync to make sure that the API is in line with what you need to do (but from my discussions this will serve his purposes - but I'd like to know that you think the same).

You really, really, really should not want to just render the Reference List HTML inline here.

We should all sync up then before this happens as there's a big flaw in the design and the proposed implementation. This won't work.
This effectively means you'd be throwing out content that editors have created. You really can't rely on references being in only one section. That's not correct.

Consider the "Notes and References" section on https://en.m.wikipedia.org/wiki/Barack_Obama
The Notes subheading contains references that are notes and specifically grouped that way.
The "References" and "Further reading" sections are not accessible from inside the article.
If you remove the references from this article, you'll have an empty "Notes" section. If you remove that section entirely the "Notes and References" section heading is confusing. Hopefully this illustrates the problem - this is just one very simple example.

Please rethink this before coding.
I thought I could do exactly what you are planning to do when we were experimenting with lazy loaded references, but I was wrong.

There's strong reasons why MobileFrontend works the way it does, whether we like it or not.

@Jdlrobson everything you mention is pretty much accounted for. We are specifically using the Barack Obama article as a test and using the problems you mention as guidance for how to proceed.

Please check out the implementation tickets and comment there on specific things you think aren't being handled:
T170690: Extract a References JSON API
T168875: Remove reference sections at end of article from page content HTML