Page MenuHomePhabricator

Force pages to be fully re-parsed occasionally
Open, Needs TriagePublic

Description

Caching is hard.

MediaWiki has $wgCacheEpoch and every row in the page database table has page.page_touched and page.page_links_updated timestamp columns.

For Varnish cache, we force pages to not be older than 30 days because we don't want to serve stale content to users.

When is a page is edited, we re-parse the page and update the relevant *links database tables.

However, the current approaches leave gaps:

  • The *links database table updates can be missed, due to job queue issues or flukes
  • Some pages aren't edited for many years, so they don't get re-parsed
  • Application code regularly gets updated, but pages then only get re-parsed "lazily", which typically translates to if they're edited

Some of the data integrity issues we've been seeing are mentioned at T87716#2316414.

I would like to investigate using one of these timestamps we store as a means of forcing pages that are more than X days since that timestamp to be fully re-parsed and regenerated. With any other cache, we would have some kind of eviction policy. With the *links cache, we seem to currently rely on the assumption that incremental updates (e.g., from linked pages being created or deleted) and occasional edits to the page will keep everything in sync. However, on large and small wikis alike, there simply isn't enough edit activity. Or a bug gets introduced for a few months that prevents updates in certain cases. Or the job queue gets overloaded and jobs are manually deleted/de-duplicated in an emergency.

A number of users have developed scripts, such as touch.py, to iterate through lists of pages and null-edit each of them. This is an effective, but hackish and silly, workaround that seems to be awfully discouraged. If we want to prevent null-edit scripts from being run so often, we need to find a way to make pages and their metadata less stale.

As we go forward, adding new data sources such as arbitrary Wikidata data to our pages, it will be even more important to make sure that we're serving relatively fresh content to users. Forcing the pages to be regenerated occasionally seems like an appropriate solution.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 23 2016, 3:42 AM

Yes please. This is a long-standing problem. Let me know if I can support this task in any way (testing or QA; I am not a developer).