Page MenuHomePhabricator

Performance and caching considerations for article placeholders accesses
Closed, ResolvedPublic

Description

Article placeholders are displayed on a dynamic special page (Special:AboutTopic/Q\d+) that serves Wikidata's content in a human readable way. This special page is right now being linked to on Special:Search (it appears in search results per default). The special page can also be accessed with arbitrary item id manually given. When there already is page on a wiki linked with a given item, we redirect to the local page instead (so if there already is an article about Berlin on a wiki, the article placeholder for the Berlin item will redirect to that article).

Special:AboutTopic basically just parses a Wikitext snippet which transcludes a template which then invokes a Scribunto/Lua module. That module does most of the heavy lifting, by using Wikibase's functionality for Wikidata data access.

  • It's planned to have these placeholders indexed by search engines (only notable ones: 3,451,555[0], as of 2016-08-14, minus the ones where articles already exists at the individual wiki). This will mean that we will probably get occasional requests to all of these pages on all wikis with AP enabled.
  • It's also planned to have article placeholders on more wikis and also on wikis with more traffic. The placeholders need to handle the additional traffic this will cause.

In order to be able to implement the changes suggested above, we will need to find a strategy for caching article placeholders and probably also to invalidate that caching in case something relevant on Wikidata changes.
The easiest solution to this would be to implement T109458: [Story] CDN cache article placeholders which suggests to cache the placeholders for a certain amount of time, without any invalidation strategy.

As a limited trial, we could make placeholders indexable on a single wiki only, or maybe even a subset of the notable placeholders on one or two wikis.

Article placeholders are currently enabled on a few low-traffic wikis, right now (as of 2016-08-14): cywiki, eowiki, guwiki, htwiki, knwiki, lvwiki, napwiki, nnwiki, orwiki, testwiki, test2wiki, testwikidatawiki (wmgUseArticlePlaceholder). Current traffic statistics for the placeholders can be found on Grafana.

[0]: On wikidata: SELECT COUNT(*) FROM wb_entity_per_page INNER JOIN page_props AS pp1 ON pp1.pp_page = epp_page_id AND pp1.pp_propname = 'wb-claims' INNER JOIN page_props AS pp2 ON pp2.pp_page = epp_page_id AND pp2.pp_propname = 'wb-sitelinks' WHERE epp_redirect_target IS NULL AND pp1.pp_value > 2 AND pp2.pp_value > 2;

Event Timeline

hoo created this task.Aug 14 2016, 6:04 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 14 2016, 6:04 PM
BBlack added a subscriber: BBlack.Aug 16 2016, 1:16 PM

30 minutes isn't really reasonable, and neither is spamming more purge traffic. If there's a constant risk of the page content breaking without invalidation, how is even 30 minutes acceptable? Doesn't this mean that on average they'll be broken for 15 minutes after an affecting change?

hoo added a comment.Aug 17 2016, 1:57 PM

30 minutes isn't really reasonable, and neither is spamming more purge traffic. If there's a constant risk of the page content breaking without invalidation, how is even 30 minutes acceptable? Doesn't this mean that on average they'll be broken for 15 minutes after an affecting change?

Well, I have no idea how often they change in a significant way, but it's probably rather days than minutes.

As you can see on https://kn.wikipedia.org/wiki/%E0%B2%B5%E0%B2%BF%E0%B2%B6%E0%B3%87%E0%B2%B7:AboutTopic/Q12345 that page includes information from a lot of Wikidata entities (mostly labels of these). In theory a change to any of them would require us to purge that page, but that's hardly implementable and will lead to a lot of purge traffic (especially given we have the placeholders on various wikis).

During Wikimania we (@Joe and I) thought about how bad it would be to just cache them for n hours. We could easily provide a ?action=purge on the special page (would be done on the MediaWiki site of things), so that people can manually purge them after editing, if they deem that to be necessary. Does that sound reasonable?

I think I'm lacking a lot of context here about these special pages and placeholders. But my bottom line thoughts are currently along these lines:

  1. How do actual, real-world, anonymous users interact with these placeholders and special pages? What value is it providing the average reader, in what way? How does the scope of the new code and new invalidation problems (esp potential purge traffic) compare to that? Because I tend to think (with what little context I have) that this sounds like a ton of churn on our end for very little real value to the user. Maybe most of the value is to logged-in editors, who don't face invalidation problems in the first place?
  1. For the most part, we can categorize the invalidation model of page content into one of two bins: either it's purged on relevant update nearly-immediately (at most, a few seconds' delay for asynchronicity and such), or it's something that sometimes goes stale for some real amount of time, where we really have to think about what happens when users read a stale page, and we need an upper bound on staleness to consider that question properly. Once you're in the latter bin of stale things, there needs to be a rational way to quantify the fallout of a stale view. Is a stale page broken itself, or does it have broken links, or simply outdated content? I tend to think that, in the examples I've seen so far, either something requires immediate invalidation, or staleness isn't a real issue within a reasonable (e.g. hours, days) timeframe. 30 minutes seems arbitrary and probably not tied to a real-world constraint on how broken a stale view is. It sounds more like a compromise because we really want immediate purging but we know the purge volume will be unreasonable.
hoo added a comment.Aug 17 2016, 3:04 PM

I think I'm lacking a lot of context here about these special pages and placeholders. But my bottom line thoughts are currently along these lines:

  1. How do actual, real-world, anonymous users interact with these placeholders and special pages? What value is it providing the average reader, in what way? How does the scope of the new code and new invalidation problems (esp potential purge traffic) compare to that? Because I tend to think (with what little context I have) that this sounds like a ton of churn on our end for very little real value to the user. Maybe most of the value is to logged-in editors, who don't face invalidation problems in the first place?

ArticlePlaceholders have the most value for readers that want information about a certain topic. The idea is to add value to (primarily) small Wikipedias so that people see (and use) them more, which in turn, will hopefully gain us new editors.
For this, we also desire to get the placeholders into search engines, to drive more traffic to those small Wikipedias (which don't have much content on their own, thus few search engine traffic).

  1. For the most part, we can categorize the invalidation model of page content into one of two bins: either it's purged on relevant update nearly-immediately (at most, a few seconds' delay for asynchronicity and such), or it's something that sometimes goes stale for some real amount of time, where we really have to think about what happens when users read a stale page, and we need an upper bound on staleness to consider that question properly. Once you're in the latter bin of stale things, there needs to be a rational way to quantify the fallout of a stale view. Is a stale page broken itself, or does it have broken links, or simply outdated content? I tend to think that, in the examples I've seen so far, either something requires immediate invalidation, or staleness isn't a real issue within a reasonable (e.g. hours, days) timeframe. 30 minutes seems arbitrary and probably not tied to a real-world constraint on how broken a stale view is. It sounds more like a compromise because we really want immediate purging but we know the purge volume will be unreasonable.

Yeah, in a perfect world we would (obviously) always serve the most up to data content from Wikidata.
I think it's ok for these pages to get a little outdated, relevant links don't change that often (most probably go to Wikidata, which has stable ids, anyway). The only issue with longer cache duration I can see is vandalism and broken pages (for example in case a Wikipedia messes up their Lua module so that the special page produces garbage) getting caught up in the caches for too long.

For this, we also desire to get the placeholders into search engines, to drive more traffic to those small Wikipedias (which don't have much content on their own, thus few search engine traffic).

May I ask how this should work? As far as I understand your special-page generates an “article” from wikidata-data if there is no real article. So if for example there would be no article about “Horse” on english Wikipedia if someone would search for “Horse” your special-page would generate it.
But how can you feed that to a external search-engine? For that the search-engine would need to ask for a page that is not there– why should it do so and for which articles should it ask?

hoo added a comment.Aug 19 2016, 12:02 AM

For this, we also desire to get the placeholders into search engines, to drive more traffic to those small Wikipedias (which don't have much content on their own, thus few search engine traffic).

May I ask how this should work? As far as I understand your special-page generates an “article” from wikidata-data if there is no real article. So if for example there would be no article about “Horse” on english Wikipedia if someone would search for “Horse” your special-page would generate it.
But how can you feed that to a external search-engine? For that the search-engine would need to ask for a page that is not there– why should it do so and for which articles should it ask?

Article placeholder doesn't redirect from the article namespace, but only works if linked to directly (for example on Special:Search or externally). We plan to feed a list of notable placeholders to search engines (possibly via an indexed special page).

hoo added a comment.Sep 2 2016, 12:44 PM

I've created T144592: Search index a limited number of article placeholders on cywiki for testing and evaluation purposes for a trial of submitting placeholders to search engines. Based on that, we can (probably) do further performance evaluation.

ema moved this task from Triage to Caching on the Traffic board.Sep 30 2016, 2:37 PM
elukey triaged this task as Normal priority.Oct 20 2016, 1:32 PM
hoo added a comment.Nov 8 2016, 9:18 AM

Heads up: In T144592: Search index a limited number of article placeholders on cywiki for testing and evaluation purposes we decided to index exactly 1,000 placeholders on eowiki.

All other placeholders will not be linked to and also have <meta name="robots" content="noindex,nofollow"/> set, so this trial is a very limited trial.

We hope to incrementally increase the number of placeholders indexed later on, depending on the findings from this trial.

Nothing was ever resolved here. 30 minutes seems like an arbitrary number with no formal basis or reasoning, and is way shorter than we'd like for anything article-like.

I clicked Submit too soon :) Continuing:

We'd expect content to be at minimum a day, if not significantly longer. MW currently emits 2-week cache headers (with plans to eventually bring that down closer to a day, but those plans are still further off). Cache invalidation is a hard problem, but it's not something we can just ignore, either. Perhaps this should be tied into the broader X-Key effort to sweep these up when the underlying wikidata is updated?

hoo added a comment.Nov 9 2016, 4:00 PM

@BBlack Given T109458: [Story] CDN cache article placeholders is not implemented, there is no caching for these pages right now. If you consider this a requirement for this limitted trial, we could look into caching them for one day (and provide a custom purge mechanism).

Hey :)

We'd really like to move forward with making the ArticlePlaceholder more useful. It not showing up in search engine results is the biggest blocker at the moment.
Given that there was no feedback on Marius' last comment can we go ahead? If we don't get a reply in the next two weeks I assume yes.

hoo closed this task as Resolved.Jan 10 2017, 5:39 PM
hoo claimed this task.

We discussed this at the developer summit with @BBlack and we decided to go for 24h edge caching.

hoo moved this task from Incoming to Done on the ArticlePlaceholder board.Jan 31 2017, 2:23 PM