Page MenuHomePhabricator

Extract a References JSON API
Closed, ResolvedPublic

Description

Create an API that does the following:

  1. Gets ALL Reference Lists from a page
  2. Returns a structured lists of references, complete with the section headers that the references were contained in
  3. If some text appears at the top or bottom of a reference list within reference section, return it within the structure for the section

References will be used in 2 contexts:

  1. A popup (as seen now in mobile web and ops)
  2. A list (similar to references sections at the end of an article)

Proposed popup for desktop:

Proposed list for desktop:

Example application of list as a popover:

Example application of 'grouped' citation pop-up:

See proposed data structure outlined in the doc:
https://docs.google.com/presentation/d/19EC_6kOYREwwC9Fieme_CcKQaT-X13Ks4IkGqR6vFFI/edit#slide=id.g2546c2693e_0_99
[pertinent slides are from 6-13]

Details

Related Gerrit Patches:
mediawiki/services/mobileapps : masterReferences: add section headings
mediawiki/services/mobileapps : masterReferences: make content an object
mediawiki/services/mobileapps : masterReferences: add specific version number
mediawiki/services/mobileapps : masterChange reference endpoint to return structured reference sections

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bearND added a subscriber: bearND.EditedJul 19 2017, 4:47 AM

Here's a strawman proposal for the JSON structure build from the following HTML example:

Source HTML example

<section>
    <h3 id="Notes">Notes</h3>
    <div class="reflist columns references-column-width" style="-moz-column-width: 30em; -webkit-column-width: 30em; column-width: 30em; list-style-type: decimal;">
    <ol class="mw-references" about="#mwt1678">
    
        <li id="cite_note-Merriam-Webster_Dictionary-1">
            <a href="./Barack_Obama#cite_ref-Merriam-Webster_Dictionary_1-0"> <span class="mw-linkback-text">↑ </span></a>
            <span id="mw-reference-text-cite_note-Merriam-Webster_Dictionary-1" class="mw-reference-text">
    
                <ul><li><cite...
    
            </span>
        </li>
    
        <li id="cite_note-Obama_1995.2C_2004.2C_pp._9.E2.80.9310-14">
            <span>
                <a href="./Barack_Obama#cite_ref-Obama_1995.2C_2004.2C_pp._9.E2.80.9310_14-0"><span class="mw-linkback-text">1 </span></a>
                <a href="./Barack_Obama#cite_ref-Obama_1995.2C_2004.2C_pp._9.E2.80.9310_14-1"><span class="mw-linkback-text">2 </span></a>
            </span>
            <span id="mw-reference-text-cite_note-Obama_1995.2C_2004.2C_pp._9.E2.80.9310-14" class="mw-reference-text">
    
                Obama (1995, 2004), pp. 9–10.<ul><li>...
    
            </span>
        </li>
    </ol>
</section>

The resulting JSON

{
  "reference_sections": 
  [
    {
      "anchor": "Notes",
      "line": "Notes",
      "id": 30, // section number
      "toclevel": 2,
      "reference_lists":
      [
        {
          "before": "<div class=\"reflist columns references-column-width\" style=\"-moz-column-width: 30em; -webkit-column-width: 30em; column-width: 30em; list-style-type: decimal;\">",
          "id": "mwt1678",
          "ol":
          [
            {
              "id": "Merriam-Webster_Dictionary-1", // "cite_note-"
              "num_linkbacks": 1,
              "text": "some quoted html text with an unordered list <ul><li><cite..."
            },
            {
              "id": "Obama_1995.2C_2004.2C_pp._9.E2.80.9310-14", // "cite_note-"
              "num_linkbacks": 2,
              "text": "Obama (1995, 2004), pp. 9–10.<ul><li>..."
            }
          ],
          "after": "</div>"
        }
      ]
    }
  ]
}

Notes

  • I kept the description of the section similar to what we've been doing for other sections, and similar to what MW API does.
  • I assume that we can build the backlinks just from the count of how many there are:
    • If there is only one backlink it gets a for the textContent.
<a href="./${title}#cite_ref-${refId}-0"> <span class="mw-linkback-text">↑ </span></a>
  • If there are more than one backlink then they are wrapped in a <span> and the textContent for each link is numbered: 1, 2, .... The link targets get the same number appended.
<span>
    <a href="./${title}#cite_ref-${refId}-${i}"><span class="mw-linkback-text">${i} </span></a>
    <a href="./${title}#cite_ref-${refId}-${i}"><span class="mw-linkback-text">${i} </span></a>
</span>

@bearND awesome thanks! This is basically looking like I expected. Here is some questions / feedback:

"anchor": "Notes",
"line": "Notes",

Whats the difference between anchor and line?

"id": 30, // section number

If you need a comment, maybe its best to just name it "section_number" for clarity

"reference_sections":

Since we may have references that aren't in "reference sections", you may want to just call this "references".
For instance, references that are in info boxes may end up here. They won't be in a reference section, but they are references. (You would still provide the section information so the clients know where they came from)

"reference_lists":

This also may be better to name more generally - I would go with something like "content".
For instance, we may have child reference section and so you may need to represent the section info again (anchor, line, section_id). Also, we may have free form text in-between the content, and you may need to represent that in this array as well.

"before": "<div class=\"reflist columns references-column-width\" style=\"-moz-column-width: 30em; -webkit-column-width: 30em; column-width: 30em; list-style-type: decimal;\">",
"after": "</div>"

I think these can be omitted, right? In the same way we return News Items in the feed, I would think we would leave it to the clients to use the structured information (section id / toc level) to properly indent and structure the references list? Or to put it another way, I don't think we need to provide this HTML if we structure the content very clearly. (Note: we would still provide the inline styles (bold, italics, in the text properties) / links html)

"ol":

Is this here to represent that this list is ordered?

A few other questions:

  1. I am not sure I understand the link backs - are they still embedded somewhere? How does a client get them?
  2. I think maybe I am missing some context with the "…", maybe it would be best to include whatever text is being represented to make the example more clear?
  3. Do we need to represent the "type" of reference in any way? Nirzar brought this up and not sure if it affects you: T164493: Highlighting shortened footnotes

"anchor": "Notes",
"line": "Notes",

Whats the difference between anchor and line?

Line is the visible text on the page for the section heading, anchor is the link target to be able to jump to the beginning of that section. Nowadays it's different when special character are used in the section heading, which would be anchor-encoded. This is until T152540 is implemented.

"id": 30, // section number

If you need a comment, maybe its best to just name it "section_number" for clarity

The "id" property is used in the current toc ("sections") schema. I was thinking we could make this one and the others for the toc match the HTML structure a bit more but that would then cause pain for all the current clients (both apps and web) since they would have to change to different terms.

Compare

{
  "anchor": "Notes",
  "line": "Notes",
  "id": 30, // section number
  "toclevel": 2
}

with

{
  "id": "Notes", // <h3 id="Notes">
  "line": "Notes",
  "section_num": 30,
  "level": 3 // <h3
}

"reference_sections":

Since we may have references that aren't in "reference sections", you may want to just call this "references".
For instance, references that are in info boxes may end up here. They won't be in a reference section, but they are references. (You would still provide the section information so the clients know where they came from)

I must have misunderstood that then. I thought you said yesterday that we don't remove reference lists if they are in infoboxes.

"reference_lists":

This also may be better to name more generally - I would go with something like "content".
For instance, we may have child reference section and so you may need to represent the section info again (anchor, line, section_id). Also, we may have free form text in-between the content, and you may need to represent that in this array as well.

Good point. I've changed the structure quite a bit, see at the end of this comment. The new structure is more flexible. So we could also deal with two consecutive section headings, like

== Notes & references ==
=== Notes ===

I believe in this case you'd want both sections to be removed. Coding the transformation would be a bit more complicated but with this structure it's at least possible to do so.

"before": "<div class=\"reflist columns references-column-width\" style=\"-moz-column-width: 30em; -webkit-column-width: 30em; column-width: 30em; list-style-type: decimal;\">",
"after": "</div>"

I think these can be omitted, right? In the same way we return News Items in the feed, I would think we would leave it to the clients to use the structured information (section id / toc level) to properly indent and structure the references list? Or to put it another way, I don't think we need to provide this HTML if we structure the content very clearly. (Note: we would still provide the inline styles (bold, italics, in the text properties) / links html)

Then we would lose some information about how the reference list should be displayed. In this example (taken from the Barack Obama article) it shows the references in columns. In other articles, which usually have fewer references in the list, this is not the case. I guess we could do that, depending on the UI design of how references should be actually displayed. With the new structure I'm proposing this is more flexible, and we can easily skip these if we want.

"ol":

Is this here to represent that this list is ordered?

Yes. Alternatively, in the new proposal I just called it list.

A few other questions:

  1. I am not sure I understand the link backs - are they still embedded somewhere? How does a client get them?

To expand the JSON structure back to HTML all we need to know is how many link backs there are. I described the general structure at the end of my previous comment.

  1. I think maybe I am missing some context with the "…", maybe it would be best to include whatever text is being represented to make the example more clear?

This represents a good chunk of HTML which would have to be quoted, basically everything until the <span> ends.

  1. Do we need to represent the "type" of reference in any way? Nirzar brought this up and not sure if it affects you: T164493: Highlighting shortened footnotes

In the resulting Parsoid HTML there doesn't seem to be a difference between a reference list including harv refs or other refs. At least, I haven't found it yet.

Anyways, here's the updated proposal:

{
  "items":
  [
    {
      "type": "section",
      "id": "Notes",
      "line": "Notes",
      "section_num": 30,
      "level": 3
    },
    {
      "type": "text",
      "text": "<div class=\"reflist columns references-column-width\" style=\"-moz-column-width: 30em; -webkit-column-width: 30em; column-width: 30em; list-style-type: decimal;\">"
    },
    {
      "type": "references",
      "id": "mwt1678",
      "list": [{
        "id": "Merriam-Webster_Dictionary-1", // "cite_note-"
        "num_linkbacks": 1,
        "text": "<ul><li><cite..."
      }, {
        "id": "Obama_1995.2C_2004.2C_pp._9.E2.80.9310-14", // "cite_note-"
        "num_linkbacks": 2,
        "text": "Obama (1995, 2004), pp. 9–10.<ul><li>..."
      }]
    },
    {
      "type": "text",
      "text": "</div>"
    }
  ]
}

"anchor": "Notes",
"line": "Notes",

Whats the difference between anchor and line?

Line is the visible text on the page for the section heading, anchor is the link target to be able to jump to the beginning of that section. Nowadays it's different when special character are used in the section heading, which would be anchor-encoded. This is until T152540 is implemented.

Gotcha… If T152540 is approved, can we switch our sectioning logic to use the HTML 5 sections?

"id": 30, // section number

If you need a comment, maybe its best to just name it "section_number" for clarity

The "id" property is used in the current toc ("sections") schema. I was thinking we could make this one and the others for the toc match the HTML structure a bit more but that would then cause pain for all the current clients (both apps and web) since they would have to change to different terms.

This is fine and expected… these PCS APIs WILL be breaking - we can take this opportunity to rename/group/change properties to make them more clear.

Compare

{
  "anchor": "Notes",
  "line": "Notes",
  "id": 30, // section number
  "toclevel": 2
}

with

{
  "id": "Notes", // <h3 id="Notes">
  "line": "Notes",
  "section_num": 30,
  "level": 3 // <h3
}

Looks better. We may want to make level more specific. like "heading_level" or "section_level".

Or we could put all this data in a specific section dictionary:

{
   "section": { 
                 "id": "Notes", // <h3 id="Notes">
                 "line": "Notes",
                 "number": 30,
                 "level": 3 // <h3
              }

"reference_sections":

Since we may have references that aren't in "reference sections", you may want to just call this "references".
For instance, references that are in info boxes may end up here. They won't be in a reference section, but they are references. (You would still provide the section information so the clients know where they came from)

I must have misunderstood that then. I thought you said yesterday that we don't remove reference lists if they are in infoboxes.

Sorry, may have not communicated this well:

Just because we don't remove it from the page, doesn't mean it isn't also within the structured reference API.

Think of it this way:
We have an API that returns all references for an article. For an API consumer, they don't care where they came from - they want to know all the references.
We also return all the article content in a separate API. Whether or not we feel that a reference list is part of the article (in an info box) or not (at the bottom of the article) is orthogonal to whether we return it in the references API. In this way, a reference list may not get removed from the article content, but is still returned in the References API. Does this make sense?

"reference_lists":

This also may be better to name more generally - I would go with something like "content".
For instance, we may have child reference section and so you may need to represent the section info again (anchor, line, section_id). Also, we may have free form text in-between the content, and you may need to represent that in this array as well.

Good point. I've changed the structure quite a bit, see at the end of this comment. The new structure is more flexible. So we could also deal with two consecutive section headings, like

== Notes & references ==
=== Notes ===

I believe in this case you'd want both sections to be removed. Coding the transformation would be a bit more complicated but with this structure it's at least possible to do so.

"before": "<div class=\"reflist columns references-column-width\" style=\"-moz-column-width: 30em; -webkit-column-width: 30em; column-width: 30em; list-style-type: decimal;\">",
"after": "</div>"

I think these can be omitted, right? In the same way we return News Items in the feed, I would think we would leave it to the clients to use the structured information (section id / toc level) to properly indent and structure the references list? Or to put it another way, I don't think we need to provide this HTML if we structure the content very clearly. (Note: we would still provide the inline styles (bold, italics, in the text properties) / links html)

Then we would lose some information about how the reference list should be displayed. In this example (taken from the Barack Obama article) it shows the references in columns. In other articles, which usually have fewer references in the list, this is not the case. I guess we could do that, depending on the UI design of how references should be actually displayed. With the new structure I'm proposing this is more flexible, and we can easily skip these if we want.

True, but think of how we return the lead image, title and wikidata description of an article as structured data. We let clients put those together any way they wish. This is much the same. We already display references in a popover in a totally different context. This API is intended to be used in much the same way.

"ol":

Is this here to represent that this list is ordered?

Yes. Alternatively, in the new proposal I just called it list.

A few other questions:

  1. I am not sure I understand the link backs - are they still embedded somewhere? How does a client get them?

To expand the JSON structure back to HTML all we need to know is how many link backs there are. I described the general structure at the end of my previous comment.

Oh really? How/why does this work? Why do you need to know this number at all?

  1. I think maybe I am missing some context with the "…", maybe it would be best to include whatever text is being represented to make the example more clear?

This represents a good chunk of HTML which would have to be quoted, basically everything until the <span> ends.

  1. Do we need to represent the "type" of reference in any way? Nirzar brought this up and not sure if it affects you: T164493: Highlighting shortened footnotes

In the resulting Parsoid HTML there doesn't seem to be a difference between a reference list including harv refs or other refs. At least, I haven't found it yet.
Anyways, here's the updated proposal:

{
  "items":
  [
    {
      "type": "section",
      "id": "Notes",
      "line": "Notes",
      "section_num": 30,
      "level": 3
    },
    {
      "type": "text",
      "text": "<div class=\"reflist columns references-column-width\" style=\"-moz-column-width: 30em; -webkit-column-width: 30em; column-width: 30em; list-style-type: decimal;\">"
    },
    {
      "type": "references",
      "id": "mwt1678",
      "list": [{
        "id": "Merriam-Webster_Dictionary-1", // "cite_note-"
        "num_linkbacks": 1,
        "text": "<ul><li><cite..."
      }, {
        "id": "Obama_1995.2C_2004.2C_pp._9.E2.80.9310-14", // "cite_note-"
        "num_linkbacks": 2,
        "text": "Obama (1995, 2004), pp. 9–10.<ul><li>..."
      }]
    },
    {
      "type": "text",
      "text": "</div>"
    }
  ]
}

@Fjalapeno

If T152540 is approved, can we switch our sectioning logic to use the HTML 5 sections?

No. I think there is a misunderstanding what T152540 is about. The title of the task is misleading IMHO. Instead of ids it should talk about anchors instead. As far as I understand T152540, it's only about changing whether any characters in the anchors are encoded and how.

Just because we don't remove it from the page, doesn't mean it isn't also within the structured reference API.

Ah, I see. This means that we would duplicate some information between the article and the references endpoints in those cases where we don't remove the references.

True, but think of how we return the lead image, title and wikidata description of an article as structured data. We let clients put those together any way they wish. This is much the same. We already display references in a popover in a totally different context. This API is intended to be used in much the same way.

Ok, that's fine. We can easily skip that for now. If we need to add it later the structure would allow it to be added later.

re: link backs
Oh really? How/why does this work? Why do you need to know this number at all?

From what I see the structure of the backlinks is always the same, so we can build the backlinks just from the count of how many there are:

  • CASE 1: If there is only one backlink it gets a for the textContent.
<a href="./${title}#cite_ref-${refId}-0"> <span class="mw-linkback-text">↑ </span></a>

Example:

<a href="./Barack_Obama#cite_ref-foo-0"> <span class="mw-linkback-text">↑ </span></a>
  • CASE 2: If there are more than one backlink then they are wrapped in a <span> and the textContent for each link is numbered: 1, 2, .... The link targets get the same number appended.
<span>
    <a href="./${title}#cite_ref-${refId}-${i}"><span class="mw-linkback-text">${i+1} </span></a>
    <a href="./${title}#cite_ref-${refId}-${i}"><span class="mw-linkback-text">${i+1} </span></a>
</span>

Example:

<span>
    <a href="./Barack_Obama#cite_ref-foo-0"><span class="mw-linkback-text">1 </span></a>
    <a href="./Barack_Obama#cite_ref-foo-1"><span class="mw-linkback-text">2 </span></a>
</span>

So, it's very predictable. In the above examples I replaced the longer strings with foo to make the important parts pop out more. The actual reference in the article has to use the same anchor next to the original link but that's ok since we preserve that in the article. The backlinks being so predictable allows us to reduce the size of the reference payload.

No. I think there is a misunderstanding what T152540 is about. The title of the task is misleading IMHO. Instead of ids it should talk about anchors instead. As far as I understand T152540, it's only about changing whether any characters in the anchors are encoded and how.

Ok that doesn't seem terribly important for us anyways

Ah, I see. This means that we would duplicate some information between the article and the references endpoints in those cases where we don't remove the references.

Yes this can lead to some "minor" duplication. But honestly (and I'm sure you can confirm) the vast majority of reference lists are at the bottom of an article so I'm not worried about this too much. I think preserving the content and getting everything in the structure outweigh the added bytes - especially in light of your research showing that removing reference lists don't really reduce content size that much anyways.
We can run your script with this algorithm to confirm as well.

So, it's very predictable. In the above examples I replaced the longer strings with foo to make the important parts pop out more. The actual reference in the article has to use the same anchor next to the original link but that's ok since we preserve that in the article. The backlinks being so predictable allows us to reduce the size of the reference payload.

Ok, I think I understand. Since they are of a known form, basically we are using a counter and incrementing to get the backlinks?

"minor" duplication

Ok, that makes sense to me. You're right, the duplication is probably is not a big deal for the reference lists in infobox or other earlier cases in the article.
How do we distinguish in code between the two cases? Only by where the sections is relative to the end of the article? Or was there anything else we could go by?

we are using a counter and incrementing to get the backlinks?

Yes.

"minor" duplication

Ok, that makes sense to me. You're right, the duplication is probably is not a big deal for the reference lists in infobox or other earlier cases in the article.
How do we distinguish in code between the two cases? Only by where the sections is relative to the end of the article? Or was there anything else we could go by?

Basically we just have the relative position of the section for now. If you come up with any improvements to that heuristic, feel free to update it

we are using a counter and incrementing to get the backlinks?

Yes.

Ok this makes sense now, thanks!

Knowing this makes me think about how we have been making efforts to reduce the string manipulation on clients (like URL creation and titles) - If we were to include the actual backlinks instead of requiring the clients to reconstruct them how many extra bytes are we actually including?

Before gzip it's around 100 bytes per backlink. Assuming that most references only have one backlink, on the Barack_Obama page with almost 500 references it would be roughly 50KB before gzip theoretically. With gzip it's actually only 3KB.
The reconstruction would be part of the library we need for the roundtrip HTML -> JSON -> HTML. Either way the client would get HTML text blocks which it would insert into the DOM using that library.

Before gzip it's around 100 bytes per backlink. Assuming that most references only have one backlink, on the Barack_Obama page with almost 500 references it would be roughly 50KB before gzip theoretically. With gzip it's actually only 3KB.

3KB sounds pretty small - seems like this could be an optimization that we don't need? Or at least the savings don't seem to be that much. Especially since this API is going to be used after the main content is loaded so it won't block the render.

The reconstruction would be part of the library we need for the roundtrip HTML -> JSON -> HTML. Either way the client would get HTML text blocks which it would insert into the DOM using that library.

I'm not sure what you mean here. What reconstruction? What are clients inserting into the DOM?

What reconstruction? What are clients inserting into the DOM?

Well, I was thinking about if the reflists have to be reconstructed in the article view. That's at least for the print case in mobile web. Then we also need a conversion of the JSON structure to HTML. I'm basically talking about a library for handling this whole round trip, which would power the Reference API on the server side (as mentioned in the diagram in T170581) for HTML to JSON and the opposite portion is used in clients (JSON to HTML).

After chatting with @Fjalapeno I agree that the return trip (JSON to HTML) is not necessary since most clients would try to display the reflist content in (native or web) components. There is the option of structuring the reference content further but that is not needed at this time.
So, a separate library for reflist handling is not needed. It can be done directly inside of MCS/PCS.
We don't want to rely on the backlink counter alone, though, since that would increase the burden on clients to piece them back together and the data savings is not significant enough to warrant this complexity.

Here's the updated proposal:

{
  "items":
  [
    {
      "type": "section",
      "id": "Notes",
      "line": "Notes",
      "section_num": 30,
      "level": 3
    },
    {
      "type": "text",
      "text": "<div class=\"reflist columns references-column-width\" style=\"-moz-column-width: 30em; -webkit-column-width: 30em; column-width: 30em; list-style-type: decimal;\">"
    },
    {
      "type": "references",
      "id": "mwt1678",
      "list": [{
        "id": "Merriam-Webster_Dictionary-1", // "cite_note-"
        "linkbacks": ["<a href=\"./Barack_Obama#cite_ref-Merriam-Webster_Dictionary_1-0\"> <span 
          class=\"mw-linkback-text\">↑ </span></a>"],
        "text": "<ul><li><cite..."
      }, {
        "id": "Obama_1995.2C_2004.2C_pp._9.E2.80.9310-14", // "cite_note-"
        "linkbacks": [
          "<a href=\"./Barack_Obama#cite_ref-Obama_1995.2C_2004.2C_pp._9.E2.80.9310_14-0\"><span class=\" mw-linkback-text\">1 </span></a>",
          "<a href=\"./Barack_Obama#cite_ref-Obama_1995.2C_2004.2C_pp._9.E2.80.9310_14-1\"><span class=\" mw-linkback-text\">2 </span></a>"
        ],
        "text": "Obama (1995, 2004), pp. 9–10.<ul><li>..."
      }]
    },
    {
      "type": "text",
      "text": "</div>"
    }
  ]
}

If there is more than one linkback then they may need to be wrapped in a <span> element.

@bearND thanks… I am reviewing this with @Nirzar and also he putting together some mocks based on this

@bearND are you able to get the type of reference?

Like book, website, journal, etc?

Fjalapeno updated the task description. (Show Details)Jul 21 2017, 8:55 PM

@Fjalapeno Yes, we can.

<cite class="citation news"...
<cite class="citation book"...
<cite class="citation web"...

@bearND! awesome, can you add that as well?

The citation level can be below the previously structured data, so I'm going to add structure on the lower level as well. I hope this is not getting to complicated for clients. Note that in the Barack Obama example the first reference has an unordered list with three list items. The first list item in there has a cite tag, followed by some other HTML content.
I also changed text -> html for the type and the property name.

{
  "items":
  [
    {
      "type": "section",
      "id": "Notes",
      "line": "Notes",
      "section_num": 30,
      "level": 3
    },
    {
      "type": "html",
      "html": "<div class=\"reflist columns references-column-width\" style=\"-moz-column-width: 30em; -webkit-column-width: 30em; column-width: 30em; list-style-type: decimal;\">"
    },
    {
      "type": "references",
      "id": "mwt1678",
      "list": [{
        "id": "Merriam-Webster_Dictionary-1", // "cite_note-"
        "linkbacks": ["<a href=\"./Barack_Obama#cite_ref-Merriam-Webster_Dictionary_1-0\"> <span class=\"mw-linkback-text\">↑ </span></a>"],
        "content": [{
            "type": "ul",
            "list_items": [{
              "type": "html cite web",
              "html": "<cite class=\"citation web\"><a rel=\"mw:ExtLink\" href=\"https://www.merriam-webster.com/dictionary/Barak\" id=\"mwBOg\">\"Barak\"</a>. <i id=\"mwBOk\"><a rel=\"mw:WikiLink\" href=\"./Merriam-Webster\" title=\"Merriam-Webster\" id=\"mwBOo\">Merriam-Webster Dictionary</a></i>.</cite><span title=\"ctx_ver=Z39.88-2004&amp;rfr_id=info%3Asid%2Fen.wikipedia.org%3ABarack+Obama&amp;rft.atitle=Hussein&amp;rft.genre=unknown&amp;rft.jtitle=Merriam-Webster+Dictionary&amp;rft_id=https%3A%2F%2Fwww.merriam-webster.com%2Fdictionary%2FHussein&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal\" class=\"Z3988\">"
            }, {
              "type": "html cite web",
              "html": "<cite class=\"citation web\"><a rel=\"mw:ExtLink\" href=\"https://www.merriam-webster.com/dictionary/Hussein\" id=\"mwBPA\">\"Hussein\"</a>. <i id=\"mwBPE\"><a rel=\"mw:WikiLink\" href=\"./Merriam-Webster\" title=\"Merriam-Webster\" id=\"mwBPI\">Merriam-Webster Dictionary</a></i>.</cite>"
            }]
        }]
      }, {
        "id": "Obama_1995.2C_2004.2C_pp._9.E2.80.9310-13", // "cite_note-"
        "linkbacks": [
          "<a href=\"./Barack_Obama#cite_ref-Obama_1995.2C_2004.2C_pp._9.E2.80.9310_13-0\"><span class=\" mw-linkback-text\">1 </span></a>",
          "<a href=\"./Barack_Obama#cite_ref-Obama_1995.2C_2004.2C_pp._9.E2.80.9310_13-1\"><span class=\" mw-linkback-text\">2 </span></a>"
        ],
        "content": [{
          "type": "ul",
          "list_items": [{
            "type": "html",
            "html": "Scott (2011), pp. 80–86."
          }, {
            "type": "html",
            "html": "Jacobs (2011), pp. 115–118."
          }, {
            "type": "html",
            "html": "Maraniss (2012), pp. 154–160."
          }]
      }]
    },
    {
      "type": "html",
      "html": "</div>"
    }
  ]
}
Fjalapeno added subscribers: DarTar, Halfak.EditedJul 25 2017, 5:34 PM

Just a note, I met with @DarTar and @Halfak to discus the structure and purpose of the API - specifically in regards to WikiCite and to the overall effort to understand citations within Wikipedia content.

Currently it appears the projects are in alignment with each other - the contexts and use cases may be different, but they appear compatible technically. When the larger projects like WikiCite are further along we will evaluate how to integrate this API with those to provide even richer citation information to Reading clients.

Another option I mentioned briefly in today's sync meeting is to inline reference content at the definition. Named references could still include the full content at each site referencing the reference.

Pros:

  • More flexible reference display: No need to strip, or re-format large references sections.
  • Shorter time to interactive references: Inline reference display (popup) is easier to implement using node-local information only, and can work as soon as content is visible.
  • Self-contained sections: Section-wise content consumption and -editing is simplified.
  • Simpler editing: Simple editors don't need to implement complex logic for updating reference lists, or for disconnecting named reference groups. Instead, this logic can be implemented once, in Parsoid.
  • Space savings: No need to add about and href attributes to connect references with their content. There might also be better compression from having locality between citation & related content.

Cons:

  • Classic reference lists are still needed for some use cases, and would need to be added in a post-processing pass. While this can be done without building a DOM, this might still take 10+ms on pages with many references.
  • Inline reference content might very slightly delay first paint, as it is loaded right away as part of the HTML stream.

It depends a bit on what design is coming up for reference lists. If there is a design to display a list of references natively we might want to still have a structured reference list endpoint. (Well, alternatively we can provide a JS library that would scan the HTML and extract the structured reference list.)
Either way, I think we would have to fill in the reference details and inline it where each reference is used.

@ssastry do you have any thoughts on the ideas regarding references? @GWicke mentioned that you may have better control within Parsoid rather than us stripping them out in the PCS

It depends a bit on what design is coming up for reference lists. If there is a design to display a list of references natively we might want to still have a structured reference list endpoint.

Currently there is - and it's primary function is to preserve the editors intention of having a list of references available for the user.
Really Design is more interested in the popup use case, but this is a case where we want to not break editors expectations.

The description doesn't mention anything about revisions and probably should. Identifiers and references can change between revisions so it's important.
This is why we pass a revision id not a title to: https://en.m.wikipedia.org/wiki/Special:MobileCite/790359434 - would be wise to make sure this is mentioned in the spec/description.

Otherwise you'll have incorrect lists of references for any given page or even worse show the wrong reference to a user.

The description doesn't mention anything about revisions and probably should. Identifiers and references can change between revisions so it's important.
This is why we pass a revision id not a title to: https://en.m.wikipedia.org/wiki/Special:MobileCite/790359434 - would be wise to make sure this is mentioned in the spec/description.
Otherwise you'll have incorrect lists of references for any given page or even worse show the wrong reference to a user.

This is true of all end points: if we use multiple end points for an article, we have to ensure they are from the same revision. This is being planned for, but just not addressed here as it is a general concern and not specific to this endpoint.

MCs had no revision support until a year ago.. Just saying. Don't take if for granted. It's easy to ignore and forget.

Change 370296 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] WIP: Change reference endpoint to return structured reference sections

https://gerrit.wikimedia.org/r/370296

phuedx added a subscriber: phuedx.Aug 8 2017, 2:52 PM
RHo updated the task description. (Show Details)Sep 21 2017, 8:26 AM
RHo updated the task description. (Show Details)Sep 21 2017, 3:14 PM
RHo updated the task description. (Show Details)Sep 21 2017, 4:49 PM
ovasileva added a subscriber: ovasileva.
cmadeo added a subscriber: cmadeo.Dec 7 2017, 11:13 PM

Change 370296 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Change reference endpoint to return structured reference sections

https://gerrit.wikimedia.org/r/370296

Change 397640 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] References: add specific version number

https://gerrit.wikimedia.org/r/397640

Heads up that last week we published this dataset parsed from the entire history of English Wikipedia: https://doi.org/10.6084/m9.figshare.5588842

Change 397640 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] References: add specific version number

https://gerrit.wikimedia.org/r/397640

Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-operations) [2018-01-04T18:52:14Z] <bsitzmann@tin> Started deploy [mobileapps/deploy@8bcffa9]: Update mobileapps to a4ba9fd (T182330 T177430 T170690 T182652 T184198)

Mentioned in SAL (#wikimedia-operations) [2018-01-04T18:58:14Z] <bsitzmann@tin> Finished deploy [mobileapps/deploy@8bcffa9]: Update mobileapps to a4ba9fd (T182330 T177430 T170690 T182652 T184198) (duration: 06m 01s)

He7d3r added a subscriber: He7d3r.Jan 6 2018, 1:20 PM

This is what we currently have for Citations:
(Cite elements usually have some kind of type indicator in the class list, like "citation web" or "citation book".)
We show only one single value in the type field. If there is a single cite tag anywhere in the reference content or there are multiple cite tags with the same value we show that value. If there are none or multiple cite tags with different values we show "generic".

So, we use "generic" here but in the summary we use "standard"[1] for the fallback value.
Question: Should we be more consistent? If so, which one do we prefer?

[1] https://gerrit.wikimedia.org/r/#/c/403553/1/lib/mobile-util.js

A note on the use of the word "generic" within the context of Page Previews client:

The generic preview is shown when either a preview can't be generated (e.g. the no content case) or the server returns an error (see T183151: Change copy on empty preview for the most current design and rationale). This is certainly a non-standard preview.

If the server was to return "generic" for the default case (the case where a preview can be generated), this'd confuse the language on the client and so I'd prefer we stick with "standard".

Change 416382 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] References: make content an object

https://gerrit.wikimedia.org/r/416382

Change 416382 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] References: make content an object

https://gerrit.wikimedia.org/r/416382

Change 419343 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] References: add section headings

https://gerrit.wikimedia.org/r/419343

Change 419343 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] References: add section headings

https://gerrit.wikimedia.org/r/419343

Hey @bearND, this seems like an Epic. Do you want me to move it back to the backlog and wait for the subtasks to resolve or should we resolve it already?

bearND closed this task as Resolved.Jun 29 2018, 4:08 PM

Hi @Jhernandez. Yes, this is kind of an epic. And we can resolve it since it's done.