Page MenuHomePhabricator

Investigate what it takes to add a __EXPECTED_UNCONNECTED_PAGE__ magic word to exclude pages from Special:UnconnectedPages
Closed, ResolvedPublic

Description

Before proceeding to add a new magic word, let's list steps and work needed to add it here.
This will serve both as 1) documentation of adding a Wikidata/Wikibase-specific magic word, and 2) data to use to evaluate if we actually are willing to implement the magic-word-based approach.

Acceptance criteria:

  • List of required changes to Wikibase code base that would be needed to be made in order to add a magic word

Hints:

  • There are already existing Wikibase-related magic words that could be looked into as inspiration/help: noexternallanglinks and wbreponame

Notes:

  • noexternallanglinks is the magic word that affects the page it is used on. In the case here, the situation would be different: the magic word is intended to affect the Special:UnconnctedPages. In other words a question to find an answer during the investigation is:
    • would the special page logic need to parse the wikitext of every single page in order to figure out which pages to lists? (that would not be deserved performance-wise).
    • Are there alternatives/features that could be used here (e.g. page props?) ?

Event Timeline

Maybe we should see how many pages might be affected by this (as upper bound, count all pages that currently aren't linked to wikidata)... with that number we will be able to make a reasonable decision.

If not terribly many pages are affected by this, page_props are the way to go (or, if almost all unlinked pages will have it, in a negated form, that indicates the page is unconnected, but shouldn't be).

I quite like EXPECTED_UNCONNECTED_PAGE mostly because it can be transcluded in templates and cause large scale fixes to users workflows (e.g. all pages using "Archived discussion" template instead of going one by one to each archived discussion and mark them as expected unconneced page.). I don't know if there's any other way to do it properly.

If not terribly many pages are affected by this, page_props are the way to go (or, if almost all unlinked pages will have it, in a negated form, that indicates the page is unconnected, but shouldn't be).

I would assume most unlinked pages would eventually have the magic word – the expected state is probably that almost all pages have either a linked item or the magic word, and items that have neither are the remaining ones that need to be cleaned up. So from that perspective, it would probably be more efficient to add a page prop when the magic word isn’t present.

However, that would also make the migration more difficult. If we add the page prop when the magic word is present, then we can deploy the code change that introduces the magic word, without any changes to the page_props table. If we want to add the page prop when the magic word isn’t present, then during the initial rollout we would need to add the page prop to all unlinked pages. (Or have an intermediate stage where we always write the page prop, with value 0 or 1 depending on whether the magic word is there or not, and then once all unconnected pages have the page prop, we can change the default interpretation and stop writing one of the two values…)

I feel like optimizing the page prop is probably not worth the effort (after all, connected pages almost certainly outnumber unconnected pages, whether expected or unexpected, as long as we don’t count excluded namespaces, and we write a page prop for each connected page), and we should just write a page prop whenever we encounter the magic word. (If that becomes a problem, we can still change it later.)

Exactly, I want to emphasize that page_props is not a scalability/performance concern. ALL of page_props (with wikidata, etc.) in enwiki is just 2.8GB, it's 25th biggest table. To compare, output of autoreview logs of flaggedrevs in ruwiki in just two years is bigger than all of page_props of enwiki. In commons it's a bit bigger but it's still 17th biggest table there. Or in other numbers it's only 0.7% of commons database and 0.3% of enwiki's database.

Even if page_props becomes big enough to be a scalability concern, it's rather easily fixable but normalizing it to a NameTableStore.