Page MenuHomePhabricator

Save references in page_props and cache
Closed, DeclinedPublic3 Estimated Story Points

Description

Saving references in page_props and a cache would allow to retrieve them efficiently through the API: T123290 and for section preview: T124840.

Related Objects

StatusSubtypeAssignedTask
OpenReleaseNone
OpenNone
OpenNone
OpenNone
OpenFeatureNone
OpenNone
Resolveddr0ptp4kt
Duplicate Jhernandez
Duplicatedr0ptp4kt
DeclinedNone
ResolvedJdlrobson
DeclinedNone
ResolvedJdlrobson
Resolvedphuedx
DeclinedNone
ResolvedNone
OpenNone
DeclinedCenarium
DeclinedNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Cenarium is making good progress with the patch but there's an improvement that can be made around how the references are accumulated during parsing. For visibility, the example I've provided is here: https://gist.github.com/phuedx/e7f48a2132010a4eef39.

@Cenarium: Just FYI, Web-Team-Backlog is tracking the progress of your patch(es) – because we're both interested and excited – so you'll likely see this ticket move around the Reading-Web-Sprint-65-Game of Phones board. If it's too noisy for you, then we'll figure something else out.

It's okay. It's Game of Phones, so I'm cool with it.

@Cenarium is making good progress with the patch but there's an improvement that can be made around how the references are accumulated during parsing. For visibility, the example I've provided is here: https://gist.github.com/phuedx/e7f48a2132010a4eef39.

As noted in gerrit, merging would result in issues when a group appears several times on a page. I've fixed the duplication issue in the latest patch set.

This is a lot of data to store in page_props. Are there any concerns about bloating that table? Might be worth pinging @jcrespo and @aaron.

I am blocking any non-regular addition of data to s2 (even if minimal) wikis until we promote the new master (hopefully on Tuesday), and very large additions until the new servers arrive (March). Aside from that, please give a ball-park figure per wiki/article, etc. (how many extra rows/bytes).

Very tall and normalized tables may be worth be compressed into LOBs, but that depends exclusively on how they are accessed- what indexes are necessary, etc.

Estimating the data size might be difficult for this. You could get a rough idea of the number of rows by querying for the number of unique pages in the el_from field of the externallinks table. That's only going to give you a rough approximation though, since many references don't include external links, and many external links aren't in references.

Estimating the bytes would be even harder. Articles like Barrack Obama have a huge amount of reference data to store (probably spanning multiple blobs), while other articles will only need a few dozen bytes. Anyone have an idea for how to estimate that?

I see you are LOBbing already, my apologies for not having looked first.

"querying for the number of unique pages in the el_from field of the externallinks table" could be enough, but may I suggest you to deploy to beta/custom test install, create Barack Obama and a couple of other representative pages and report the results, both for parsing and caching?

The second questions is, what are the plans to fill in this information? Will you run a job on all articles or will you let it create it naturally?

I am not the performance guy here, but what is the impact on saving?

Estimating the data size might be difficult for this. You could get a rough idea of the number of rows by querying for the number of unique pages in the el_from field of the externallinks table. That's only going to give you a rough approximation though, since many references don't include external links, and many external links aren't in references.

A search for "insource:<ref>" on enwiki returns 3,628,920 results.

Estimating the bytes would be even harder. Articles like Barrack Obama have a huge amount of reference data to store (probably spanning multiple blobs), while other articles will only need a few dozen bytes. Anyone have an idea for how to estimate that?

Special:PagesWithProp gives the size of a given property for each page (there's also an API). I had imported several enwiki articles for testing. The compressed refs JSON for Obama is 50KB, it isn't even a blob. The article with the biggest compressed refs JSON size I've found is List of Dutch inventions and discoveries at 77 KB. But those are the most extreme, I don't expect more than a few dozen pages to take more than a blob, virtually none more than two.

It shouldn't be too difficult to estimate an average size, probably no more than a couple KB.

The second questions is, what are the plans to fill in this information? Will you run a job on all articles or will you let it create it naturally?

I am not the performance guy here, but what is the impact on saving?

It is filled during LinksUpdate, after parse (if references have changed), it's a deferred update so it has no impact on saving. It's also cached on parse and on db read.

A search for "insource:<ref>" on enwiki returns 3,628,920 results.

This doesn't include non-article pages, which would it bring around 4M. We can also check the transclusion count of Template:Reflist, which is 3,750,000+.

Will you run a job on all articles or will you let it create it naturally?

We might have to, since this change isn't going to be picked up by the regular RefreshLinksJob.

4 Million rows in a 26 million table does not worry me too much. The average row length now, however, is 79 bytes. I do not know if it will be larger now- if it is, you should consider add it to a separate table.

I am more worried about the RefreshLinksJob and let me add, @BBlack, as this is another potential "purge all pages cache" process, which we are currently battling with and has a large impact on the whole infrastructure both at database and varnish level.

I will definitely test it on a small wiki first.

4 Million rows in a 26 million table does not worry me too much. The average row length now, however, is 79 bytes. I do not know if it will be larger now- if it is, you should consider add it to a separate table.

It would be at least an order of magnitude larger, at most a couple KB. So you think we need to create a new "big_table_props" table ?

I am more worried about the RefreshLinksJob and let me add, @BBlack, as this is another potential "purge all pages cache" process, which we are currently battling with and has a large impact on the whole infrastructure both at database and varnish level.

I guess we could do without a job and let the db populate naturally (MobileFrontend/section preview would have to rollback to the previous behavior if nothing is found in the db). That being said, it would only have to occur once after deploy, and we should be able to make it run only for pages with references (4M). So it would be about the same level of disruption as an admin editing Template:Reflist or another heavily transcluded template.

I am not the performance guy here, but what is the impact on saving?

References HTML in a page on our larger pages currently accounts for 50% of the HTML we send to our users. Loading this on demand would cut in half Barack Obama's HTML size from 185KiB to 92.5KiB.

This would in turn reduce load time and potentially first paint - specifically on our 2G users meaning they can access our content where previously this was not possible or was too slow. In tests, albeit on Parsoid content, on the extreme cases we saw 27% decrease in fully load time and 55% improvement in first paint on a 2G connection... so fingers crossed these tests replicate themselves in production.

@jcrespo Is there anything blocking us from feature flagging this in the Cite extension and enabling it in on the beta cluster to get you more accurate answers about performance impact and then later enabling it on a small wiki to give you a better sense of the storage size.

In terms of clearing pages - I don't think this should be necessary. We would plan to roll this out slowly, so could easily allow this to happen over time and only launch the performance improvement when the parser has populated....

@jcrespo Is there anything blocking us from feature flagging this in the Cite extension and enabling it in on the beta cluster to get you more accurate answers about performance impact and then later enabling it on a small wiki to give you a better sense of the storage size.

Not only it is not blocked, I am actually encouraging all of you to please test it thoroughly before hitting production and a gradual eventual roll in to production.

My only worry is potentially affecting the performance of queries that join the page_props table. If we do decide to use that table, rather than creating a new one, I think it would be a good idea to create some baseline stats on such queries so that we can actually see the before and after performance stats. For example, the Special:Random query joins the page_props table. If it's execution time grows substantially due to this feature, we'll know that we should move the data into a separate table instead.

I've reviewed @Cenarium's patch and it looks good. There's a query inline about possibly avoiding an additional page_props query if the JSON blob is corrupt (for whatever reason).

@kaldari could you list some URLS for API requests you are worried about?
We could set up a script that hits the API before and after the change and measure the time it took.

My suggestion would be:

  • We merge @Cenarium 's patch
  • We collect X days of data around time needed for the list of API requests you give me on pages which have references
  • Enable feature flag on beta cluster to store references
  • We collect X days of data around time needed for API requests on pages which have references
  • Compare results.
  • If results show little to no degradation we then roll out to a smaller wiki with ops blessing/permission to get a sense of any unexpected images. (any particular one you'd recommend?)

Thoughts?

With regards to performance impacts of lazy loading references on mobile web I have a card in the next sprint which can be done without any of this (T126390)

@kaldari Let me give you some light on the infrastructure side of things. I cannot for the most part say "this is ok" or "this is not ok" without some stats.

  • For larger wikis, a separate table may be preferable: easier to manage, one will not affect the other, and there *can* be a difference in performance as it will probably have a lot of records
  • For smaller wikis, a separate table makes no sense; in fact, reducing the number of tables may be even an advantage, as we handle 40K tables on a single host, and that starts to be problematic. If all you do it accessing records by primary key, there will not be a huge difference in any case.

Where is the limit between the two, or if one of the 2 makes no sense?- that we do not know and we will not discover until it is tested with *load* and *sizes* similar (or a bit larger than a development machine) to production. Things like labs virtual machines are preciselly thought for this kind of testing and many people host there mediawikis for testing.

Change 267514 merged by jenkins-bot:
Store references in page_props and cache

https://gerrit.wikimedia.org/r/267514

I'll leave this open for now as it's tracking @Cenarium's work, not ours.

I'd consider this closed only if this goes into production (with this table or another one).

I was thinking about this over the weekend, and I'm now re-thinking the whole page props approach.
The references-1 page prop is surfaced in API but not very useful without the references-2 prop... even with it... it's arguable. Was it thus the right decision to store it in page props?
e.g. http://localhost:8888/w/index.php/Special:ApiSandbox?useformat=desktop#action=query&format=json&prop=pageprops&titles=Selenium_References_test_page&ppprop=references-1

The advantage for page props seems to be API access but that doesn't seem to help here.
Anyone have any thoughts around this?

Page_props allows to store the data somewhere. But this still needs a proper API (T123290), the one of page_props is indeed not useful here. The API should simply call Cite::getStoredReferences( $title ).

This approach is the one used by the Graph extension to store Graph data, and by the TemplateData extension to store TemplateData.
They encode and compress the data to store it in page_props, then provide an API that decompresses and decodes the data.

Page_props allows to store the data somewhere. But this still needs a proper API (T123290), the one of page_props is indeed not useful here. The API should simply call Cite::getStoredReferences( $title ).

This approach is the one used by the Graph extension to store Graph data, and by the TemplateData extension to store TemplateData.
They encode and compress the data to store it in page_props, then provide an API that decompresses and decodes the data.

I setup T127225 :)

I'm adding Citoid here just in case this might interest you.

Estimating the data size might be difficult for this. You could get a rough idea of the number of rows by querying for the number of unique pages in the el_from field of the externallinks table. That's only going to give you a rough approximation though, since many references don't include external links, and many external links aren't in references.

Estimating the bytes would be even harder. Articles like Barrack Obama have a huge amount of reference data to store (probably spanning multiple blobs), while other articles will only need a few dozen bytes. Anyone have an idea for how to estimate that?

It is fairly easy to do this in Parsoid -- we right now emit various metrics during rt-testing and we can dump one more metric (size:references). And, you can then munge the raw data from 160K pages from ~30 wikis. Let me know if anyone wants this done.

I've raised concerns about this in the past in Reading Web meetings and elsewhere. Just want to make sure this goes through proper measures and considerations before we roll this out to any production wikis.

Design as it was proposed above:

  • Make a copy of the generated Parser HTML for <references> in the Cite extension and store this in the page_props table when saving a new revision for the page.
  • page_props is not versioned by revision. It's only associated by page. This means it only stores data associated with the current version of a page.
  • It will have easily have up to 20Kb or 50Kb of data to store. It's essentially the HTML as served to users for the references section. The savings pointed out are irrelevant from Database perspective. It'll still be stored in parser cache for other consumers (which at this point is everything except opt-ed in mobile views).
  • Due to the uncontrollable size, it may need to store multiple rows in page_props for a single page.
  • The added rows to page_props is in theory as many as there are pages with references. Possibly multiplied various times since it can store any number of rows depending on how much output.

Design as I see it implemented in the Cite extension currently:

Some issues:

  • No eviction policy. Parser cache data stored elsewhere expires after 30 days but this doesn't. This means the computed result may go stale as things change in the underlying implementation of Cite. The version number will help regeneration in time (if we remember to update it), but not eviction. The feature only needs a volatile cache, but is instead implemented on top of MySQL (permanent storage with replication and slaves) - with each entry stored indefinitely. We just finished a painful migration in ResourceLoader where it was using the database as a cache. We should not be writing new features that make the same mistake.
  • No consistent identifier. The current storage model, as being page properties, is limited to associating data with pages, whereas this data should be associated with revisions. As a result there will be two problems:
    • When viewing the latest version of pages, MobileFrontend will likely often fail to fetch the appropiate references because the page has changed meanwhile. As such, references may have renumbered or point to different citations entirely, or may no longer exist.
    • It will be unable to render specific or old revisions of a page (from user contributions page, recent changes, or page history).

A simpler alternative previously proposed by me:

  • The original plan describes that we'll strip out references on the fly when serving revision content and fetch it lazily from API from page props. Instead of fetching from page props, just fetch it from the revision content cache again but this time strip out everything except the references.

As I've said on multiple occasions, the 64KB maximum size of pp_value is not an accident, it reflects the cost of putting data there. It needs to be written in LinksUpdate on every edit, during a very performance-sensitive transaction. If you need anything like 64KB, you should find some other data store.

I've raised concerns about this in the past in Reading Web meetings and elsewhere. Just want to make sure this goes through proper measures and considerations before we roll this out to any production wikis.

Design as it was proposed above:

  • Make a copy of the generated Parser HTML for <references> in the Cite extension and store this in the page_props table when saving a new revision for the page.

It's not the html but the (slightly parsed) wikitext. Also it's done whenever the page is parsed, not just when making a new revision for the page.

  • page_props is not versioned by revision. It's only associated by page. This means it only stores data associated with the current version of a page.
  • It will have easily have up to 20Kb or 50Kb of data to store. It's essentially the HTML as served to users for the references section. The savings pointed out are irrelevant from Database perspective. It'll still be stored in parser cache for other consumers (which at this point is everything except opt-ed in mobile views).
  • Due to the uncontrollable size, it may need to store multiple rows in page_props for a single page.
  • The added rows to page_props is in theory as many as there are pages with references. Possibly multiplied various times since it can store any number of rows depending on how much output.

From testing, no enwiki article has more than 2 rows. It's theoretically possible to make pages with more rows, so if there is a concern that this could be abused we can enforce a limit.

Design as I see it implemented in the Cite extension currently:

This one is unrelated, it's about another, much older and different cache, not used in production AFAIK.

Some issues:

  • No eviction policy. Parser cache data stored elsewhere expires after 30 days but this doesn't. This means the computed result may go stale as things change in the underlying implementation of Cite. The version number will help regeneration in time (if we remember to update it), but not eviction. The feature only needs a volatile cache, but is instead implemented on top of MySQL (permanent storage with replication and slaves) - with each entry stored indefinitely. We just finished a painful migration in ResourceLoader where it was using the database as a cache. We should not be writing new features that make the same mistake.

This feature will not only be used by MobileFrontend but also by section previews (T124840), references bots through the API and probably others. A persistent storage is not needed by MF or section preview, so using a 30 days ($wgParserCacheExpireTime) cache would be an option there if we can't use the db, but then bots and other API clients wouldn't be able to retrieve refs from pages that haven't been parsed for more than $wgParserCacheExpireTime (this isn't a major concern for WMF wikis where it's 30 days).

  • No consistent identifier. The current storage model, as being page properties, is limited to associating data with pages, whereas this data should be associated with revisions. As a result there will be two problems:
    • When viewing the latest version of pages, MobileFrontend will likely often fail to fetch the appropiate references because the page has changed meanwhile. As such, references may have renumbered or point to different citations entirely, or may no longer exist.

This shouldn't happen since whenever the page is re-parsed, the cache and db is updated. Even assuming slave lag, since the cache has been updated it will retrieve the latest data.

  • It will be unable to render specific or old revisions of a page (from user contributions page, recent changes, or page history).

In those cases, it could render as it previously did. Unfortunately we would miss out the optimizations if FlaggedRevs is used (old stable revs). We would have to fall back to the [edit: FlaggedRevs stable] parser cache.

A simpler alternative previously proposed by me:

  • The original plan describes that we'll strip out references on the fly when serving revision content and fetch it lazily from API from page props. Instead of fetching from page props, just fetch it from the revision content cache again but this time strip out everything except the references.

This may work for MobileFrontend, but not for the other use cases (section preview and bots need wikitext). For these other use cases, we need to store this additional data in the db or in the parser cache.
Though, this means the data wouldn't be stored separately from the rest of the content, which I thought from T125134 was desirable? (I should add that separating out references is possible during parse but only in (slightly parsed) wikitext form, as was done, and not in html.)
Just a small note, we should consider only sending to the client the html for the requested reference, not for all of them, since connection speed appears to be the limiting factor.

As I've said on multiple occasions, the 64KB maximum size of pp_value is not an accident, it reflects the cost of putting data there. It needs to be written in LinksUpdate on every edit, during a very performance-sensitive transaction. If you need anything like 64KB, you should find some other data store.

Probably only the parser cache could fit then.

Revision-independent storage seems conceptually wrong for this:

  1. user opens article on mobile
  2. someone edits the references
  3. user scrolls down, references get lazy-loaded

-> user gets an inconsistent view of the article.

Design as it was proposed:

  • Make a copy of the generated Parser HTML for <references> [..]

It's not the html but the (slightly parsed) wikitext.

See also T127263. If we don't store the HTML, then I worry about the performance of the API request. We'd have to re-parse the same wikitext again in the lazy-load request. The request we can't cache because it doesn't have a revision ID. The API request itself would cache its parser result somewhere (with a key containing a hash of the wikitext and the parser cache key - page title, user language etc.). Each request would: Fetch raw wikitext from database page_props table (presumably using a master connection to avoid slave lag, on a GET request, which triggers a DB warning), then hash it and fetch from a custom parser cache, if absent (first view after edit), invoke Parser.

  • page_props is not versioned by revision. It's only associated by page. This means it only stores data associated with the current version of a page.

[..]

  • No consistent identifier. The current storage model, as being page properties, is limited to associating data with pages, whereas this data should be associated with revisions. As a result there will be two problems:
    • When viewing the latest version of pages, MobileFrontend will likely often fail to fetch the appropiate references because the page has changed meanwhile. As such, references may have renumbered or point to different citations entirely, or may no longer exist.

This shouldn't happen since whenever the page is re-parsed, the cache and db is updated. Even assuming slave lag, since the cache has been updated it will retrieve the latest data.

@Tgr explains the same. One can't assume people atomically view the latest version. There is database slave lag, Squid/Varnish purge lag and rebound purge, browser backward-forward cache, and more. Even in a world in without caching or slave lag; by the time the HTML response for a page view goes on its way back to the user, the contents may have changed numerous times. And in case of lazy-loading there is an actual intentional delay between the view and the lazy-load request. This request and the underlying storage must carry the revision ID.

A simpler alternative previously proposed by me: [..]

This feature will not only be used by MobileFrontend but also by section previews (T124840), references bots through the API and probably others. [..]
This may work for MobileFrontend, but not for the other use cases (section preview and bots need wikitext). [..]

Section previews with references is an interesting use case. It'll require a merge between references from other sections with the currently edited section. This gets difficult due to the absence of persistent section identifiers. Even if we include a section number with ref item, sections mutate all the time. The merge gets harder without a base revision.

"References bots" is new to me, but reliable editing also requires various meta data for authentication and edit conflict resolution and such.

A simpler alternative previously proposed by me:

  • The original plan describes that we'll strip out references on the fly when serving revision content and fetch it lazily from API from page props. Instead of fetching from page props, just fetch it from the revision content cache again but this time strip out everything except the references.

Hm, what do you mean by revision content cache? The parser cache contains the parser output only for the latest revision of the page (as of the latest page_touched - to handle new revisions to included pages/files). The only way to obtain the html of an old revision, including its references, is to parse it again (or have it saved elsewhere previously), except for FlaggedRevs stable revisions which have their own parser cache (FlaggedRevs/backend/FRParserCacheStable.php). (It's not possible to retrieve references html from wikitext without a parse of the full revision.)

Revision-independent storage seems conceptually wrong for this:

  1. user opens article on mobile
  2. someone edits the references
  3. user scrolls down, references get lazy-loaded

-> user gets an inconsistent view of the article.

Per above, there doesn't appear to exist any way to store this data based on revision id. And this can also happen when an included template/page is modified (page_touched), but storing data for every such event doesn't seem feasible.
Caching the html of references extracted during the original mobile view for a short period of time based on revision id may be an option, but this would still considerably increase cache size. The page_props result is cached though, and much smaller in size (since it's wikitext), so we could add a revision id to its cache key and use it. If this cache has expired, then we have no other choice but to re-parse the revision and get the references data from it.

See also T127263. If we don't store the HTML, then I worry about the performance of the API request. We'd have to re-parse the same wikitext again in the lazy-load request. The request we can't cache because it doesn't have a revision ID. The API request itself would cache its parser result somewhere (with a key containing a hash of the wikitext and the parser cache key - page title, user language etc.). Each request would: Fetch raw wikitext from database page_props table (presumably using a master connection to avoid slave lag, on a GET request, which triggers a DB warning), then hash it and fetch from a custom parser cache, if absent (first view after edit), invoke Parser.

A single reference is a small enough amount of wikitext that it gets parsed very quickly. This API call doesn't use the parser cache since it's text-based (not page-based).
The request doesn't have to fetch page_props from master since it first tries a cache that gets filled when the page is parsed, precisely to handle any slave lag.

@Tgr explains the same. One can't assume people atomically view the latest version. There is database slave lag, Squid/Varnish purge lag and rebound purge, browser backward-forward cache, and more. Even in a world in without caching or slave lag; by the time the HTML response for a page view goes on its way back to the user, the contents may have changed numerous times. And in case of lazy-loading there is an actual intentional delay between the view and the lazy-load request. This request and the underlying storage must carry the revision ID.

The parser cache can't fix any of that, since it only stores the latest revision as last parsed. Implementing a revision-based caching of references' wikitext can solve most of these issues. The remaining ones are intrinsic to any implementation of a lazy loading of references and may not be fixable, they are however small enough to not outweigh the advantages of lazy loading, IMO.

Section previews with references is an interesting use case. It'll require a merge between references from other sections with the currently edited section. This gets difficult due to the absence of persistent section identifiers. Even if we include a section number with ref item, sections mutate all the time. The merge gets harder without a base revision.

References have a unique identifier so it's unneeded, except on pages with multiple <references/> tags with <ref> tags in between, a rare occurrence, which is thus not supported in https://gerrit.wikimedia.org/r/#/c/267515/.

"References bots" is new to me, but reliable editing also requires various meta data for authentication and edit conflict resolution and such.

See T127263#2044692.

So to sum up, IMO:

  1. Latest revision view still latest as of reference lookup (i.e. > 99% of cases): fetch from page_props/references cache, or main parser cache if we can't use the db (still not convinced of that without at least some testing)
  2. Latest revision view no longer latest as of reference lookup (<< 1% of cases): try the references cache, if expired go to 4.
  3. Stable revision not latest and still stable as of reference lookup (maybe up to 1% of cases in FlaggedRevs-heavy setups): fetch from FlaggedRevs stable parser cache
  4. Any other revision (<< 1% of cases): re-parse the revision being viewed, fetch references data from parser output

This can be done by modifying according to these points the Cite::getStoredReferences() function - a revision id param can be added to it and the API.

A few thoughts in reaction to latest discussions:

  • We're trying to keep the solution non-MobileFrontend specific for the bots and section previews use case. I know @Halfak is interested in an API, would be great if he could let us know his use cases.
  • Interaction to clicking reference links in mobile view is low according to EventLogging - At 0.1 sample rate we get about 70k clicks a day to a reference link (this is not unique users). I'm thus not sure about how big an issue the @Tgr example is in practice - but it would be straightforward to pass revision in API request and throw an error when the revision has changed. Also bundling the JSON blob in mw.config would remove this issue entirely (note it is also much smaller than the HTML list version).
  • Note a page may have more than one references tag which makes non-JS fallbacks easier to construct. Otherwise we have to keep track of multiple references section.
  • I'm exploring mobileview as an option in T129182 but I feel this is also a suboptimal solution and should only be a temporary solution. I'm actually encountering more issues with broken reference lookups with that approach than the existing approach.
  • Maybe we are entering RFC territory here? If so, what would the RFC be for?

Hi @Jdlrobson, right now, I'm just hoping to figure out what is actually being stored in the page_props table.

But to answer the general question about use-cases, I've been working on nice ways to produce datasets of references and scholarly identifiers. Researchers and product teams (e.g. editing) use these datasets to analyze citation patterns. E.g. which templates are used, what Journals show up in DOIs and what domains are linked to the most? It would be great if we could make references structured in nice ways that would let this work be done easily in quarry or via the API. Right now, I have to run complex, CPU-intensive jobs against the XML dumps in order to extract this information.

A few thoughts in reaction to latest discussions: users). I'm thus not sure about how big an issue the @Tgr example is in practice - but it would be straightforward to pass revision in API request and throw an error when the revision has changed. Also bundling the JSON blob in mw.config would remove this issue entirely (note it is also much smaller than the HTML list version).

Throwing an error is a very good idea and simplifies everything :)
We then just need to handle FlaggedRevs where the stable revision, actually displayed to readers, might not be the latest version.
We could send to the API query revision ids instead of page ids. Here's the now much simplified API workflow:

  1. If passed revision is currently the latest rev, retrieve ref data from references page_props / its cache [or from the parser cache if we really can't use the db]
  2. Else if flaggedrevs is enabled and passed revision is currently the stable rev, retrieve ref data from flaggedrevs stable parser cache (would need a new function in Cite class)
  3. Else throw an error, suggesting to refresh the page

In addition, we should not lazy load references for revisions accessed from an &oldid=... url since there's no point to optimize these, and it would too often result in 2.C. (avoiding this not being worth it).

  • Maybe we are entering RFC territory here? If so, what would the RFC be for?

Not sure if we need a RFC yet, I see two issues:

  • need for revision-based storage

Since FlaggedRevs can be handled with its own parser cache, we can simply throw an error if the revision is outdated, and old revisions should IMO not be lazy loaded, at this point I think we don't need a revision-based storage.

  • need for database storage

This depends on use cases. If we can't use page_props, we could use the parser cache. The shorter the parser cache expiry is, the more it's likely to cause issues (and isn't there a risk for the parser cache to be unreliable on some wikis or even be disabled? there cache misses would be fatal - or would have to be handled by an expansive re-parse of the entire revision...). It shouldn't be a problem on WMF wikis for Mobile Frontend or section preview (since a recent parse is certain to have occurred), also probably not for @Halfak's use case. Issues would come up if working on very low visibility pages, those not parsed in over a month (on WMF wikis).

Hi @Jdlrobson, right now, I'm just hoping to figure out what is actually being stored in the page_props table.

This is what gets surfaced via API: P2721
What gets stored in page_props is basically the same with a few minor differences (not a single object, no reflist property)

A few thoughts in reaction to latest discussions: users). I'm thus not sure about how big an issue the @Tgr example is in practice - but it would be straightforward to pass revision in API request and throw an error when the revision has changed. Also bundling the JSON blob in mw.config would remove this issue entirely (note it is also much smaller than the HTML list version).

Throwing an error is a very good idea and simplifies everything :)

My thought if the references blob could keep track of when it last changed.
e.g. revisionId: X, lastChangeRevisionId: Y

To take an example if the current revision is say 505, we are viewing revision 500 and references were last edited (something added/removed) on revision id 400, then the current revision would be okay. If the current revision was say 399, then the revisions would be outdated and the client would be required to respond with an error.

As you probably know, RESTBase was initially created to address a very similar need: The revisioned storage of HTML and metadata. Apart from regular revisioning, there is also generic support for TTL-based expiry of per-revision re-renders.

We are happy to chat about possibilities. Just give us a ping.

Change 277445 had a related patch set uploaded (by Cenarium):
Retrieve stored references from parser cache

https://gerrit.wikimedia.org/r/277445

A fairly typical FlagRev configuration (used e.g. on dewiki) is that anonymous visitors see the stable version but everyone else sees the latest version. No idea how that's handled internally but the Cite API should probably follow suit.

https://gerrit.wikimedia.org/r/#/c/277445/ uses the parser cache as storage rather than page_props. When the stable rev is requested, it instead checks the FlaggedRevs stable parser cache.
MobileFrontend could always request the stable rev, unless the version viewed is a draft which can be seen by checking the existence of some html id.

My thought if the references blob could keep track of when it last changed.
e.g. revisionId: X, lastChangeRevisionId: Y

To take an example if the current revision is say 505, we are viewing revision 500 and references were last edited (something added/removed) on revision id 400, then the current revision would be okay. If the current revision was say 399, then the revisions would be outdated and the client would be required to respond with an error.

I don't think it would be possible to efficiently keep track of when references were last changed. And the latest page revision might not even have yet been saved to the database when the references blob is created.
The easiest way to implement an invalidation would be with $parserOutput->getCacheRevisionId(). We might want to still display the ref though, but with a warning that it may be outdated, rather than displaying an error.

I've found a way to determine when a ref was last changed and to handle FlaggedRevs :)

This requires a new parser cache (and parser output) custom built for Cite that gets generated every time a parse is done and provided with the references data.
In addition, whenever a parsed reference is requested, the cache gets updated with the parsed reference content.
The "last modified" timestamp is obtained by comparing the new references data to the previously cached one.
Size doesn't seem to be an issue for the parser cache. FlaggedRevs's parser cache essentially duplicates the main parser cache on all articles where it's enabled except those with an outdated stable version.

A new API is created which provides the parsed reference along with the "last modified" info so clients can make their own validation when viewing articles that may have been changed in the mean time.

The strip markers issue (T127787), due to {{#tag:ref|...}} constructs with embedded <ref>, <nowiki>, etc, is also addressed by saving some additional data - this issue seems to have come up previously as the parser has functions for this: serializeHalfParsedText/unserializeHalfParsedText. It isn't perfect though (StripState::getSubState would need to be run recursively), but https://gerrit.wikimedia.org/r/#/c/277893/ should very simply solve the vast majority of those problems.
There are a few changes needed to core: https://gerrit.wikimedia.org/r/#/c/278609/ and its dependencies, plus one to FlaggedRevs: https://gerrit.wikimedia.org/r/#/c/277945/. I'll upload the patches for Cite.

Change 278703 had a related patch set uploaded (by Cenarium):
[WIP] Use custom self-regenerating parser cache for references

https://gerrit.wikimedia.org/r/278703

Meeting with @Jdlrobson and @GWicke:

  • The JavaScript API request must pass along the current page view's revision id (wgRevisionId) so that the API can ensure it uses the right content in case the page has changed (it can try the main parser cache first and fallback to parsing on-demand). This will also automatically make it work when viewing an old revision from recent changes, user contributions, page history etc.
  • The SpecialPage fallback must contain the revision id in its url for the same reason (e.g. Special:References/<revisionId>#<refId>).
  • The main parser cache only holds html for "current" page revisions. If the page changed while viewing, and/or in case of FlaggedRevs and when viewing older revisions, we may be able to leverage RestBase to fetch stable html for any revision, regardless of whether it's the current revision or not. This would be a lot faster than re-parsing on-demand. The catch here is that Parsoid (behind RestBase) is not 1:1 compatible with how the legacy Parser renders pages. As such, unless Parsoid makes some modifications to allow us to match them across engines, we won't be able to use this for now. We'll need to stick to falling back to parsing on-demand, perhaps behind a simple cache layer of sorts.
  • The JavaScript API request must pass along the current page view's revision id (wgRevisionId) so that the API can ensure it uses the right content in case the page has changed (it can try the main parser cache first and fallback to parsing on-demand). This will also automatically make it work when viewing an old revision from recent changes, user contributions, page history etc.

Passing the revision id is not enough: references can be modified from pages included within it. References defined within included pages or templates are actually frequent - more than 20,000 templates contain references, not to mention the countless project pages that rely on transcluded pages, where use of references is common. Hence I've passed the last modified timestamp in the patch above.

  • The SpecialPage fallback must contain the revision id in its url for the same reason (e.g. Special:References/<revisionId>#<refId>).

For the same reason it should rather pass the (last modified) timestamp of the article as accessed and throw a warning if the reference was modified after it: Special:References&action=viewparsed&page=...&refid=...&timestamp=...

  • The main parser cache only holds html for "current" page revisions. If the page changed while viewing, and/or in case of FlaggedRevs and when viewing older revisions, we may be able to leverage RestBase to fetch stable html for any revision, regardless of whether it's the current revision or not. This would be a lot faster than re-parsing on-demand. The catch here is that Parsoid (behind RestBase) is not 1:1 compatible with how the old PHP Parser generates. As such, unless Parsoid makes some modifications to allow us to match then across engines, we won't be able to use this for now. We'll need to stick to falling back to parsing on-demand, perhaps behind a simple cache layer of sorts.

FlaggedRevs has it own parser cache, and depending on config it can use the stable version of included pages / templates too, so isn't necessarily the same as the standard parser cache even if the latest and stable revisions are synced.
As for old revisions, it's too much complication for no benefit really, just don't lazy load references on them - it's useless to optimize there.
Now I see Parsoid mentioned as a solution, but it can't be one really, this must work for vanilla mediawiki core + cite (for section preview), and mediawiki core + cite + MobileFrontend for lazy loading. There are plenty of wiki with cite extension but not parsoid, or mobilefrontend and cite but not parsoid - it should never be a requirement for such a feature.

Izno moved this task from Unsorted backlog to Doing on the Cite board.

@aaron, @Krinkle: Can we use the stash cache for storing the half-parsed wikitext of references (for the duration of the parser cache)? (see https://gerrit.wikimedia.org/r/#/c/277445/13)

This feature will be removed (see T222373)

Change 277445 abandoned by Thiemo Kreuz (WMDE):
Store references data in stash cache

Reason:
The API is gone since I7088f25.

https://gerrit.wikimedia.org/r/277445