Page MenuHomePhabricator

WIP RFC: Reference API requirements and options
Closed, InvalidPublic

Description

Problem description

The implementation of citations is currently closely tied to skins, and not very optimized for working with citations as structured data. New use cases require more structured access to a) all revision content, and b) individual references.

Use cases

Mobile web

The mobile web team would like to speed up page load times by lazily loading references only when the user clicks on a reference link.
Specifically: A reference link inside a superscript tag has a unique identifier and this unique identifier can be used to retrieve the HTML for the reference and render it in a reference drawer:

Screen Shot 2016-03-22 at 1.58.59 PM.png (275×746 px, 82 KB)

TODO

TODO: @Halfak @kaldari: @Cenarium Add your use cases

Requirements

  • For each revision, support efficient retrieval of
    • page content without references, and
    • all reference blocks.
  • For each revision, support retrieval of individual references.
  • Ideally, provide stable identifiers of unmodified references across revisions.

Possible solutions

Goal: Discuss different options, focusing on trade-offs.

Stable reference IDs

Current reference ids are based on counters, which makes them inherently unstable between revisions or parser implementations (PHP parser vs. Parsoid). It would be useful to instead use stable reference ids, based on something like a hash of the reference wikitext.

Advantages:

  • Enables tracking of unmodified reference content across revisions.
  • Automatic cache busting for per-reference APIs.
  • Can simplify reuse of expanded reference content.
  • Makes use of Parsoid reference content from PHP-parser views robust.

Disadvantages:

  • Need to add an additional id attribute, increasing HTML size.
  • Cost of calculating hashes.

Potential avenues worth exploring

  • Ability to update a reference URL in one place (for example if page A and page B use reference C and reference C link moves to a different url, could potentially be updated in one place)
  • Ability to edit references without editing entire page. e.g. fix dead links / use wayback machine for rewrites

Questions:

  • Do we need to continue supporting current numbered reference links long term?

Event Timeline

Jdlrobson updated the task description. (Show Details)
Jdlrobson added a subscriber: Cenarium.
Krinkle renamed this task from WIP RFC: Reference API requirements and options to WIP [RFC] Reference API requirements and options.Mar 23 2016, 8:15 PM
Krinkle renamed this task from WIP [RFC] Reference API requirements and options to WIP RFC: Reference API requirements and options.

It feels to me like a big mis-step to build lots of long-term infrastructure around the current inline unstructured disparate non-annotated end-notes system for references. Ideally we'd put the effort into the actual replacement system (i.e., structured, centralised references than can be applied as annotations to statements and shared between context with a number of display models suited to the context – page-end-notes, hoverable, come-with-on-paste, etc.).

Though it's been talked about on and off for a few years, I can't seem to find anything in Phabricator about this (though T90852: Create a system to store and query links to books would cover a small part) of it, however.

Expanding upon my reply at T125329#2146478:
A revision-based system is inadequate for MobileFrontend: references can be modified from pages included within it. We need a timestamp-based system (timestamp at the time of parse). References defined within included pages or templates are actually frequent - more than 20,000 templates contain references, not to mention the countless project pages that rely on transcluded pages, where use of references is common.

This should also support FlaggedRevs, and depending on config, FlaggedRevs can use stable revisions of the included pages, so a revision-based system is insufficient, we need to know if the stable or draft version is requested. The only way to get the references for the stable revision is by asking the stable parser cache.

MobileFrontend just needs the references for the latest version, or the stable version, we don't need to optimize other revisions. Providing support for old revisions is feature creep. There's a reason why there's no parser cache for them. It would also be inadequate to the stated objective of protecting against modifications of references, since as I've mentioned above references are often defined on included pages or templates. Or we would have to store data for every time a page gets "touched", twice as that with FlaggedRevs enabled... seems out of the question. The only solution is to re-parse the whole revision, which IMO would be too much hassle for the benefits expected.
MobileFrontend should just not lazy load references when accessing old revisions, and when a stable/latest revision is accessed and the requested reference was changed in the mean time it should ask to reload - problem solved.

For references in section previews, we need the half parsed wikitext as is (not parsed), which then gets reinjected into the preview parse. We cannot rely on Parsoid or other fancy stuff - this needs to work for vanilla mediawiki core + cite (same for MobileFrontend).

As for the stable reference IDs, we don't need this to know when a reference was last modified. And templates included within references can change too.

Potential avenues worth exploring

  • Ability to update a reference URL in one place (for example if page A and page B use reference C and reference C link moves to a different url, could potentially be updated in one place)

Seems prone to bugs, don't think there's been any request for such a feature and use would be IMO too limited. Refs shared among pages are a good idea but would have to be properly implemented in Cite. But it's off-topic, we should provide support for the current implementation.

  • Ability to edit references without editing entire page. e.g. fix dead links / use wayback machine for rewrites

Why would it be bad to edit the entire page in that case ?

Questions:

  • Do we need to continue supporting current numbered reference links long term?

It worked fine until now, I don't see why this should change.

So let me add a couple requirements:

  • The solution should work for standard mediawiki + cite, which means no fancy stuff like Parsoid.

(There are plenty of wikis with Cite or MobileFrontend but without parsoid that would hugely benefit from those features.)

  • We need the half parsed wikitext as part of the data.

(For section preview)

  • We need the last modified timestamp.

(For MobileFrontend, cf issue with included pages / templates.)

So the only viable solution is to retrieve the references from the stable or latest parser cache. We can maintain a "last modified" timestamp by comparing the new references data to the previously cached one. The references data and last timestamp also work for references defined within included pages (since the half parsed text is already expanded). This is exactly what is done in https://gerrit.wikimedia.org/r/#/c/278703/. I've tested this extensively and it woks fine. References aren't parsed initially, but when a request for a parsed reference is made, they get parsed then saved to cache, so they don't need to be re-parsed (unless specifically modified at a later date).

Krinkle triaged this task as Medium priority.Mar 24 2016, 5:03 AM

@DarTar is organizing an event ( T125186 ) w/ @Harej & @Daniel_Mietchen to map the long term support of structured citation data within the movement -- and doing it within the existing citation infrastructure. The priority at the moment is understanding what that structure and client relationship with Wikidata will look like -- and how to do that storage in a way that supports most of the case studies currently available.

The test environment for the structured citations is at T120115

It feels to me like a big mis-step to build lots of long-term infrastructure around the current inline unstructured disparate non-annotated end-notes system for references. Ideally we'd put the effort into the actual replacement system (i.e., structured, centralised references than can be applied as annotations to statements and shared between context with a number of display models suited to the context – page-end-notes, hoverable, come-with-on-paste, etc.).

I support this comment, fwiw, especially given @Sadads comment about ongoing work on structured citations.

Thanks @Sadads for the ping. @GWicke there is indeed quite a lot of work in the pipeline to support citations as structured data. For some context, see this talk I gave in December.

We'll be sending out an announcement for WikiCite shortly (Monday morning at the latest). The event is probably going to be the best place to meet other people and community members involved in this effort, discuss technical requirements for accessing citations from a structured database and potentially building dedicated API endpoints. (cc @dr0ptp4kt - I talked to @Jdlrobson about this today).

Thanks for the heads-up, @Sadads!

The original scope of this RFC is limited to providing more structured access and component interfaces for existing citation content. As such, it could perhaps provide a stepping-stone towards more ambitious citation systems mentioned above. It would likely help in the task of migrating existing citations into structured data by providing easy access to template parameters and other metadata, and supporting editing via structured HTML.

In any case, I think we should wait until we have more clarity about how this fits with your longer-term plans.

VisualEditor in mobile should then probably match the behavior of page views and not output references list, especially considering that they are static blocs of limited usefulness.
Not sure if the transmission of data can be avoided though.

@Cenarium I would like to push back a bit on "they are static blocs of
limited usefulness". Citations are at the core of how English Wikipedia and
many other Wikipedia's prove authority, credibility, and reliability to
their readers. Like the critique during the late 00's about Wikipedia's
eschewing of authority, in regions like India, Latin American and Africa,
Wikipedia is coming under a lot of critique, because its new and different
and challenges different structures of expert information. Making the
footnotes less visible, makes it easier for either a) new readers or b)
critics, to challenge whether or not readers should use it. Moreover,
footnotes are at the core of the resesarch/educational/university value of
Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Research_help/Proposal

Building software features that deliberately hide or make hard to find the
citations is a mistake for social reasons, especially in our biggest growth
market in mobile, is in the places where Wikipedia's authority is being
challenged.

@Cenarium I would like to push back a bit on "they are static blocs of
limited usefulness". Citations are at the core of how English Wikipedia and
many other Wikipedia's prove authority, credibility, and reliability to
their readers. Like the critique during the late 00's about Wikipedia's
eschewing of authority, in regions like India, Latin American and Africa,
Wikipedia is coming under a lot of critique, because its new and different
and challenges different structures of expert information. Making the
footnotes less visible, makes it easier for either a) new readers or b)
critics, to challenge whether or not readers should use it. Moreover,
footnotes are at the core of the resesarch/educational/university value of
Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Research_help/Proposal

Building software features that deliberately hide or make hard to find the
citations is a mistake for social reasons, especially in our biggest growth
market in mobile, is in the places where Wikipedia's authority is being
challenged.

I was talking about VisualEditor here, not regular page views. Do you mean that you are opposed to lazy loading of references on mobile? (The objective being to make WP more accessible to slow connections - T123328.)
@Jdlrobson Maybe there should be a way to fully load a references list with JS (e.g. by clicking on a 'load references' button)?

By "static block of limited usefulness", I mean that when a reference is changed in VisualEditor, the reference as it appears in the reflist does not reflect the modification, so there is not much point in outputting the bloc of references in VisualEditor.

RobLa-WMF mentioned this in Unknown Object (Event).Apr 13 2016, 7:34 PM

The event (Wikicite) is scheduled for May 25-26, 2016 in Berlin: https://meta.wikimedia.org/wiki/WikiCite_2016

RobLa-WMF mentioned this in Unknown Object (Event).May 4 2016, 7:33 PM

By "static block of limited usefulness", I mean that when a reference is changed in VisualEditor, the reference as it appears in the reflist does not reflect the modification, so there is not much point in outputting the bloc of references in VisualEditor.

This is not true.

You're probably thinking of the problem with people using the hack that is {{reflist}} which they're explicitly not meant to use, which indeed can't be updated (amongst other issues).

GWicke edited projects, added Services (designing); removed Services.

@DarTar, is there anything in this task that is still relevant to your citation work? I would be inclined to close it otherwise.

GWicke lowered the priority of this task from Medium to Low.Jul 11 2017, 11:12 PM

This RFC seems to be stalled. If there is currently no interest in driving this further, it should for now be removed from the RFC work board.
If there is interest in continuing the RFC process, please let us (TechCom) known who will be working on this RFC, and who commits to implementing it if approved, and in what time frame.

Removing from the RFC workboard as this continues to appear to be stalled.

(Per RFC policy, adding to Backlog instead. Reason: Resourcing is currently missing, and problem statement still WIP.)

Pchelolo subscribed.

Reading infrastructure has created references endpoint for the use of mobile.