Page MenuHomePhabricator

Determine service infra for HTML dumps
Closed, ResolvedPublic

Description

Some questions so we can get this service up:

How long and how many dumps do we want to keep for starters?
What directory structure do we want?
What about metadata such as md5sums for downloaders?
Should there be an RSS feed announcing the latest complete run of a given wiki? Do we run them daily, weekly?
How often do we run from scratch and how often do we update?
Would we ever want to dump more than just ns0 and project/portal namespaces?

Event Timeline

ArielGlenn assigned this task to GWicke.
ArielGlenn raised the priority of this task from to Needs Triage.
ArielGlenn updated the task description. (Show Details)
ArielGlenn subscribed.

Would we ever want to dump more than just ns0 and project/portal namespaces?

Should be configurable per-wiki?

maybe default to $wgContentNamespaces

How long and how many dumps do we want to keep for starters?

I don't have strong feelings about this. The compressed dumps don't use up much space, so we could just keep them until we need to reclaim the disk space?

What directory structure do we want?

We definitely need a working dir outside the docroot. For the dumps, the main options are one directory per dump run (all wikis), or one directory per wiki / dump type (all dates). The former is easier to build, the latter might be more convenient for users.

What about metadata such as md5sums for downloaders?

We could have an md5sums.txt file in each directory.

Should there be an RSS feed announcing the latest complete run of a given wiki? Do we run them daily, weekly?

Based on the current run times (~10 hours for all wikis) and the ease of incremental updates I'd propose weekly. ZIM generation will also take some time.

How often do we run from scratch and how often do we update?

Unless there are bugs or other reasons to reset it'll always be faster to update a copy.

Would we ever want to dump more than just ns0 and project/portal namespaces?

I guess we could split by content (ns0) and non-content?

@ArielGlenn, is this task still useful, or is there another task tracking current work?

Resolving this one, as current work is tracked at T133547 per @ArielGlenn, and there isn't much life left here.