Page MenuHomePhabricator

Declare opted-in abstract articles indexable by external search engines
Open, MediumPublic

Description

Declare opted-in abstract articles indexable by external search engines.

External search engines (Google, Bing, and the long tail of smaller crawlers) are how most readers discover mainspace content on WMF wikis, and an opted-in abstract article that is invisible to those crawlers is indistinguishable, from the perspective of a reader searching externally, from an article that doesn't exist. For the Q1 pilot to evaluate whether the content experience works in practice, readers have to be able to reach these articles the same way they reach any other mainspace article — and the dominant "any other way" on WMF wikis is an external search engine. Declaring the opted-in articles indexable is therefore the minimum necessary condition for the pilot to be a fair test of the model; it is not an aggressive SEO push, it is parity with normal mainspace content.

That parity is a hard constraint and rules out anything that would give opted-in articles an edge over locally-authored articles in external search: no selective noindex on some pages and not others, no sitemap priority boosting, no schema.org structured-data markup beyond what the wiki's skin already emits for every article (except the isBasedOn provenance linkage explored below, if the exploration concludes it fits), and no submission to webmaster-tools consoles. The bar is strictly "these pages behave like normal mainspace articles from a crawler's perspective", and any behaviour that departs from that bar is out of scope for this bullet.

The page-level metadata (robots, canonical, hreflang) is already emitted by the mainspace rendering sub-bullet; this bullet covers the remaining outbound surface — sitemap inclusion, an optional machine-readable provenance linkage, and the community-facing framing of why we are making these pages crawlable in the first place.

Acceptance criteria:

  • Opted-in abstract articles appear in the standard sitemap generated by the wiki's sitemap generation path (via an extension point if one already exists, or via a newly-added hook or patch if one does not), verified by inspecting a generated sitemap on Test Wikipedia.
  • Sitemap entries for opted-in articles reflect opt-in set changes within the normal sitemap regeneration cadence, with no bespoke faster-invalidation path added.
  • The machine-readable provenance header exploration has been conducted and its outcome is recorded: either a standard header (schema.org isBasedOn, rel="via", dcterms:source, or similar) is being emitted alongside page-level metadata, or the exploration is documented as concluding nothing fits and no bespoke header is invented.
  • No behaviour departs from "these pages behave like normal mainspace articles from a crawler's perspective" — no selective noindex, no sitemap priority boosting, no schema.org markup beyond the skin's default output plus the provenance header above (if emitted), no webmaster-tools submission.
  • The community-facing justification for indexing by default (parity with normal mainspace, and discoverability without M3) is recorded in the Phabricator task description so that it travels with the work.